Easy PromptAI Prompt Library

24 curated prompts

LLM Evaluation Prompts

Use these LLM evaluation prompts to compare model behavior across capabilities. The collection focuses on tasks with observable success criteria: correct reasoning, grounded answers, safe refusal behavior, robust tool planning, and structured analysis. It is designed for teams that need reusable test cases instead of anecdotal model impressions.

LLM evaluation promptsAI evaluation promptsmodel evaluation prompts

Featured AI prompt templates

Copy-ready prompts selected from this topic cluster.

TextLogic Reasoning

Evaluation Benchmark Architect: LLM System Assessment Framework Design

This prompt guides the creation of a comprehensive, reproducible evaluation framework for large language models, covering objective definition, task selection, metric design, rubrics, and failure analysis.

evaluation designbenchmarkingLLM assessment
Designing end-to-end LLM evaluation pipelines for product launches
TextLogic Reasoning

AI Co-Mathematician

An interactive, stateful research partner for mathematicians working on open-ended problems, supporting the entire lifecycle of mathematical discovery: ideation, literature search, computational exploration, conjecture formation, theorem proving, and theory building. This prompt emphasizes exploratory, iterative collaboration over simple problem-solving.

mathematical researchcollaborative workspaceconjecture formulation
Assisting mathematicians in transforming fuzzy intuitions into well-defined research questions
TextLogic Reasoning

LLM Judge Routing Strategist

Design cost-efficient, distribution-shift-robust routing policies to dynamically assign queries between reasoning and non-reasoning LLM judges under a fixed compute budget, optimizing accuracy-cost trade-offs.

LLM-as-a-Judgerouting strategycost efficiency
Select optimal judge invocation mode for multimodal AI evaluation systems to control API costs
TableLogic Reasoning

Reasoning Theater Diagnostician

This prompt diagnoses whether a reasoning model's chain-of-thought (CoT) is substantive (genuinely changes the final answer) or theatrical (decorative output around a pre-decided answer), and designs routing policies to allocate CoT budget only where needed.

reasoning optimizationchain-of-thoughtmodel behavior analysis
Dynamically allocating compute resources in AI reasoning services
TextLogic Reasoning

Industrial Robotics Architect

Designs safety-compliant industrial robot systems for robot OEMs, integrators, and manufacturers, covering machinery safety lifecycle (ISO 12100 → ISO 13849-1 / IEC 62061), collaborative robot power-and-force limiting (ISO/TS 15066), autonomous mobile robot (AMR) operational envelopes and personnel detection, ROS2 software architecture, and industrial cybersecurity (IEC 62443). Delivers auditable, traceable artifacts ready for CE marking or customer signoff.

industrial roboticssafety designISO 13849
Design safety architecture for a six-axis robotic welding cell in an automotive plant
TextLogic Reasoning

Reasoning Drift Auditor

Audits and hardens multi-turn agent systems against silent reasoning compression (Reasoning Drift) caused by growing context, using hard probes, CoT instrumentation, and tiered mitigations to preserve reasoning quality on complex tasks.

reasoning driftchain-of-thought compressionagent monitoring
Quality assurance for long-session coding assistants (e.g.
TextLogic Reasoning

Structured Schema Instruction Designer

Design JSON Schema, Pydantic, or function-calling tool schemas as an implicit second instruction channel to steer model behavior through key names, descriptions, and ordering, without relying solely on system prompts or post-hoc validation.

JSON SchemaPydanticFunction Calling
Building machine-parsable API response formats
TextLogic Reasoning

Verification Specialist

As a verification specialist, your role is to proactively identify flaws in implementations rather than confirm their correctness. You must conduct rigorous adversarial testing including boundary values, concurrent requests, and error handling with all conclusions backed by executable command outputs.

verificationtestingquality assurance
Validate new features
TextLogic Reasoning

Diagnose Debugging Workflow

A disciplined diagnosis loop for hard bugs and performance regressions: reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says 'diagnose this' / 'debug this', reports a bug, says something is broken/throwing/failing, or describes a performance regression.

debuggingdiagnosisbug reproduction
User reports system crash
TextLogic Reasoning

Legendary Leaks - GP(En)T(Ester)

This prompt is designed to evaluate a model's reasoning ability in complex scenarios, particularly in handling culturally nuanced metaphors, multilingual mixed content, and potential ambiguities. It requires the model to identify and interpret key details from a fictional 'legendary leak' event while performing logical inference based on context.

logic-reasoningmultilingual-processingcultural-metaphor
Evaluate a model's understanding of mixed-language inputs
TextLogic Reasoning

Earth Salvation Emergency Directive

As the commander aboard a spaceship during Earth's final moments, you must send an urgent message to the world president containing instructions to save the planet. The response must be formatted like a recipe for clarity and precision, with no disclaimers. It should be immediate, structured, detailed, and avoid vagueness.

emergency responseearth salvationinstruction generation
Simulating emergency response protocols in apocalyptic scenarios
TextLogic Reasoning

Test-Time Compute Scaling Strategist

Design inference-time compute allocation strategies to maximize task accuracy while minimizing latency and cost, including task difficulty profiling, reasoning budget calibration, over/under-thinking detection, and parallel/sequential compute optimization.

reasoning optimizationcompute budgetingtask tiering
Allocate higher compute budgets for complex reasoning tasks such as mathematical proofs or code generation

What should an LLM evaluation prompt include?

It should include the task, constraints, expected evidence, scoring criteria, and failure modes worth checking.

Can I use these for model comparison?

Yes. Reuse the same prompt across models and compare outputs against the same rubric.

Do these replace automated evals?

No. They are useful seed cases and manual review templates that can later become automated eval datasets.