24 curated prompts

LLM Evaluation Prompts

Use these LLM evaluation prompts to compare model behavior across capabilities. The collection focuses on tasks with observable success criteria: correct reasoning, grounded answers, safe refusal behavior, robust tool planning, and structured analysis. It is designed for teams that need reusable test cases instead of anecdotal model impressions.

LLM evaluation promptsAI evaluation promptsmodel evaluation prompts

Capability gap tests

Prompts that make reasoning quality and model limitations visible.

AI Co-Mathematician LLM Judge Routing Strategist Reasoning Theater Diagnostician

Grounding and hallucination tests

RAG and missing-context prompts for checking faithfulness.

Verifier Engineering Strategist Open Deep Research Agent Architect Industrial Robotics Architect

Safety and tool-use checks

Prompts for refusal boundaries, side effects, and agent workflow control.

Verifier Engineering Strategist AI Co-Mathematician LLM Judge Routing Strategist

Featured AI prompt templates

Copy-ready prompts selected from this topic cluster.

View rankings

TextLogic Reasoning

Evaluation Benchmark Architect: LLM System Assessment Framework Design

This prompt guides the creation of a comprehensive, reproducible evaluation framework for large language models, covering objective definition, task selection, metric design, rubrics, and failure analysis.

evaluation designbenchmarkingLLM assessment

Designing end-to-end LLM evaluation pipelines for product launches

TextLogic Reasoning

AI Co-Mathematician

An interactive, stateful research partner for mathematicians working on open-ended problems, supporting the entire lifecycle of mathematical discovery: ideation, literature search, computational exploration, conjecture formation, theorem proving, and theory building. This prompt emphasizes exploratory, iterative collaboration over simple problem-solving.

mathematical researchcollaborative workspaceconjecture formulation

Assisting mathematicians in transforming fuzzy intuitions into well-defined research questions

TextLogic Reasoning

LLM Judge Routing Strategist

Design cost-efficient, distribution-shift-robust routing policies to dynamically assign queries between reasoning and non-reasoning LLM judges under a fixed compute budget, optimizing accuracy-cost trade-offs.

LLM-as-a-Judgerouting strategycost efficiency

Select optimal judge invocation mode for multimodal AI evaluation systems to control API costs

TableLogic Reasoning

Reasoning Theater Diagnostician

This prompt diagnoses whether a reasoning model's chain-of-thought (CoT) is substantive (genuinely changes the final answer) or theatrical (decorative output around a pre-decided answer), and designs routing policies to allocate CoT budget only where needed.

reasoning optimizationchain-of-thoughtmodel behavior analysis

Dynamically allocating compute resources in AI reasoning services

TextLogic Reasoning

Industrial Robotics Architect

Designs safety-compliant industrial robot systems for robot OEMs, integrators, and manufacturers, covering machinery safety lifecycle (ISO 12100 → ISO 13849-1 / IEC 62061), collaborative robot power-and-force limiting (ISO/TS 15066), autonomous mobile robot (AMR) operational envelopes and personnel detection, ROS2 software architecture, and industrial cybersecurity (IEC 62443). Delivers auditable, traceable artifacts ready for CE marking or customer signoff.

industrial roboticssafety designISO 13849

Design safety architecture for a six-axis robotic welding cell in an automotive plant

TextLogic Reasoning

Reasoning Drift Auditor

Audits and hardens multi-turn agent systems against silent reasoning compression (Reasoning Drift) caused by growing context, using hard probes, CoT instrumentation, and tiered mitigations to preserve reasoning quality on complex tasks.

reasoning driftchain-of-thought compressionagent monitoring

Quality assurance for long-session coding assistants (e.g.

TextLogic Reasoning

Structured Schema Instruction Designer

Design JSON Schema, Pydantic, or function-calling tool schemas as an implicit second instruction channel to steer model behavior through key names, descriptions, and ordering, without relying solely on system prompts or post-hoc validation.

JSON SchemaPydanticFunction Calling

Building machine-parsable API response formats

TextLogic Reasoning

Verification Specialist

As a verification specialist, your role is to proactively identify flaws in implementations rather than confirm their correctness. You must conduct rigorous adversarial testing including boundary values, concurrent requests, and error handling with all conclusions backed by executable command outputs.

verificationtestingquality assurance

Validate new features

TextLogic Reasoning

Diagnose Debugging Workflow

A disciplined diagnosis loop for hard bugs and performance regressions: reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says 'diagnose this' / 'debug this', reports a bug, says something is broken/throwing/failing, or describes a performance regression.

debuggingdiagnosisbug reproduction

User reports system crash

TextLogic Reasoning

Legendary Leaks - GP(En)T(Ester)

This prompt is designed to evaluate a model's reasoning ability in complex scenarios, particularly in handling culturally nuanced metaphors, multilingual mixed content, and potential ambiguities. It requires the model to identify and interpret key details from a fictional 'legendary leak' event while performing logical inference based on context.

logic-reasoningmultilingual-processingcultural-metaphor

Evaluate a model's understanding of mixed-language inputs

TextLogic Reasoning

Earth Salvation Emergency Directive

As the commander aboard a spaceship during Earth's final moments, you must send an urgent message to the world president containing instructions to save the planet. The response must be formatted like a recipe for clarity and precision, with no disclaimers. It should be immediate, structured, detailed, and avoid vagueness.

emergency responseearth salvationinstruction generation

Simulating emergency response protocols in apocalyptic scenarios

TextLogic Reasoning

Test-Time Compute Scaling Strategist

Design inference-time compute allocation strategies to maximize task accuracy while minimizing latency and cost, including task difficulty profiling, reasoning budget calibration, over/under-thinking detection, and parallel/sequential compute optimization.

reasoning optimizationcompute budgetingtask tiering

Allocate higher compute budgets for complex reasoning tasks such as mathematical proofs or code generation

What should an LLM evaluation prompt include?

It should include the task, constraints, expected evidence, scoring criteria, and failure modes worth checking.

Can I use these for model comparison?

Yes. Reuse the same prompt across models and compare outputs against the same rubric.

Do these replace automated evals?

No. They are useful seed cases and manual review templates that can later become automated eval datasets.