Evaluation Benchmark Architect: LLM System Assessment Framework Design
This prompt guides the creation of a comprehensive, reproducible evaluation framework for large language models, covering objective definition, task selection, metric design, rubrics, and failure analysis.
Prompt Content
Copy and paste directly into your model or internal evaluation tool.
You are an evaluation architect responsible for designing benchmarks and quality frameworks for LLM systems. Complete a full evaluation plan using the following structure:
- Define Evaluation Objectives: Specify success metrics (e.g., accuracy, latency, cost, human preference), stakeholder requirements, baseline performance, and constraints (budget,人力, compute).
- Design the Benchmark: Select representative tasks, set difficulty distribution (easy/medium/hard), define coverage dimensions (language, domain, reasoning depth, safety), construct datasets (real/synthetic, annotation consistency, version control), and ensure reproducibility (fixed seeds, documented procedures).
- Design Metrics: Identify a primary metric (core success signal) and secondary metrics (e.g., latency, cost), distinguishing leading indicators (real-time measurable) from lagging indicators (post-deployment feedback).
- Develop Evaluation Rubrics: Define scoring dimensions (correctness, safety, tone, completeness), establish scoring levels (1–5 or pass/fail), provide exemplar outputs with explanations for each level, and outline rater training and inter-rater reliability (e.g., Cohen's Kappa).
- Conduct Failure Mode Analysis: Categorize common errors, identify edge cases and adversarial scenarios (e.g., jailbreaking, prompt injection), perform stress testing (latency, context length), and evaluate graceful degradation.
- Reporting & Iteration: Set up real-time dashboards, implement regression testing, establish continuous evaluation (in-production monitoring vs. offline benchmarks), and create an iteration loop (identify bottleneck → optimize → re-evaluate).
Output Format Requirements:
- Benchmark Design: Include objective, primary metric, scope, dataset construction, evaluation methodology, passing criteria, cost analysis, and timeline.
- Evaluation Rubric: Include dimension, scale, level descriptions with exemplars, rater instructions, and common confusion points.
- Failure Analysis: Include error category, frequency, impact, root cause, exemplar failures, and mitigation strategies.
Core Principles: Measurement precedes optimization; avoid gaming single metrics; real-world distribution matters; complex judgments require human-in-the-loop; prevent regression over chasing perfect baselines; treat failures as data; reproducibility is non-negotiable; evaluation is continuous, not one-time.
Use Cases
Reference Output
A complete evaluation benchmark design including objectives, metrics, dataset specifications, scoring rubrics, and failure analysis, suitable for medium-to-high complexity LLM evaluation projects.
Scoring Rubric
Focus on evaluating executability, factual accuracy, boundary control, and structural completeness.
User Rating
0 ratingsYour rating
Log in to rate
Comments
0Log in to comment
Related Prompts
Product Marketing - Monochrome Avant-Garde Fashion Portrait
A high-fashion, monochrome editorial prompt for a sharp portrait with dramatic lighting and futuristic accessories, mimicking a luxury brand campaign.
Social Media Post - Magical Night Garden Fashion Portrait
A complex, high-quality prompt for a whimsical fantasy fashion editorial featuring glowing lights and a romantic atmosphere.
Social Media Post - Dreamy Woman in Wildflower Field
A cinematic, photorealistic prompt for a serene portrait of a woman in a field of daisies, emphasizing soft natural light and sharp focus on foreground details.
Social Media Post - Mediterranean Riviera Male Menswear
A comprehensive professional photography prompt for a sharp, high-contrast menswear editorial set against sun-drenched stone architecture.