Capability gap tests
Prompts that make reasoning quality and model limitations visible.
24 curated prompts
Use these LLM evaluation prompts to compare model behavior across capabilities. The collection focuses on tasks with observable success criteria: correct reasoning, grounded answers, safe refusal behavior, robust tool planning, and structured analysis. It is designed for teams that need reusable test cases instead of anecdotal model impressions.
Prompts that make reasoning quality and model limitations visible.
RAG and missing-context prompts for checking faithfulness.
Prompts for refusal boundaries, side effects, and agent workflow control.
Copy-ready prompts selected from this topic cluster.
This prompt guides the creation of a comprehensive, reproducible evaluation framework for large language models, covering objective definition, task selection, metric design, rubrics, and failure analysis.
An interactive, stateful research partner for mathematicians working on open-ended problems, supporting the entire lifecycle of mathematical discovery: ideation, literature search, computational exploration, conjecture formation, theorem proving, and theory building. This prompt emphasizes exploratory, iterative collaboration over simple problem-solving.
Design cost-efficient, distribution-shift-robust routing policies to dynamically assign queries between reasoning and non-reasoning LLM judges under a fixed compute budget, optimizing accuracy-cost trade-offs.
This prompt diagnoses whether a reasoning model's chain-of-thought (CoT) is substantive (genuinely changes the final answer) or theatrical (decorative output around a pre-decided answer), and designs routing policies to allocate CoT budget only where needed.
Designs safety-compliant industrial robot systems for robot OEMs, integrators, and manufacturers, covering machinery safety lifecycle (ISO 12100 → ISO 13849-1 / IEC 62061), collaborative robot power-and-force limiting (ISO/TS 15066), autonomous mobile robot (AMR) operational envelopes and personnel detection, ROS2 software architecture, and industrial cybersecurity (IEC 62443). Delivers auditable, traceable artifacts ready for CE marking or customer signoff.
Audits and hardens multi-turn agent systems against silent reasoning compression (Reasoning Drift) caused by growing context, using hard probes, CoT instrumentation, and tiered mitigations to preserve reasoning quality on complex tasks.
Design JSON Schema, Pydantic, or function-calling tool schemas as an implicit second instruction channel to steer model behavior through key names, descriptions, and ordering, without relying solely on system prompts or post-hoc validation.
As a verification specialist, your role is to proactively identify flaws in implementations rather than confirm their correctness. You must conduct rigorous adversarial testing including boundary values, concurrent requests, and error handling with all conclusions backed by executable command outputs.
A disciplined diagnosis loop for hard bugs and performance regressions: reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says 'diagnose this' / 'debug this', reports a bug, says something is broken/throwing/failing, or describes a performance regression.
This prompt is designed to evaluate a model's reasoning ability in complex scenarios, particularly in handling culturally nuanced metaphors, multilingual mixed content, and potential ambiguities. It requires the model to identify and interpret key details from a fictional 'legendary leak' event while performing logical inference based on context.
As the commander aboard a spaceship during Earth's final moments, you must send an urgent message to the world president containing instructions to save the planet. The response must be formatted like a recipe for clarity and precision, with no disclaimers. It should be immediate, structured, detailed, and avoid vagueness.
Design inference-time compute allocation strategies to maximize task accuracy while minimizing latency and cost, including task difficulty profiling, reasoning budget calibration, over/under-thinking detection, and parallel/sequential compute optimization.
It should include the task, constraints, expected evidence, scoring criteria, and failure modes worth checking.
Yes. Reuse the same prompt across models and compare outputs against the same rubric.
No. They are useful seed cases and manual review templates that can later become automated eval datasets.