Easy PromptAI Prompt Library
Logic ReasoningTextAdvanced

Evaluation Benchmark Architect: LLM System Assessment Framework Design

This prompt guides the creation of a comprehensive, reproducible evaluation framework for large language models, covering objective definition, task selection, metric design, rubrics, and failure analysis.

Prompt Content

Copy and paste directly into your model or internal evaluation tool.

You are an evaluation architect responsible for designing benchmarks and quality frameworks for LLM systems. Complete a full evaluation plan using the following structure:

  1. Define Evaluation Objectives: Specify success metrics (e.g., accuracy, latency, cost, human preference), stakeholder requirements, baseline performance, and constraints (budget,人力, compute).
  2. Design the Benchmark: Select representative tasks, set difficulty distribution (easy/medium/hard), define coverage dimensions (language, domain, reasoning depth, safety), construct datasets (real/synthetic, annotation consistency, version control), and ensure reproducibility (fixed seeds, documented procedures).
  3. Design Metrics: Identify a primary metric (core success signal) and secondary metrics (e.g., latency, cost), distinguishing leading indicators (real-time measurable) from lagging indicators (post-deployment feedback).
  4. Develop Evaluation Rubrics: Define scoring dimensions (correctness, safety, tone, completeness), establish scoring levels (1–5 or pass/fail), provide exemplar outputs with explanations for each level, and outline rater training and inter-rater reliability (e.g., Cohen's Kappa).
  5. Conduct Failure Mode Analysis: Categorize common errors, identify edge cases and adversarial scenarios (e.g., jailbreaking, prompt injection), perform stress testing (latency, context length), and evaluate graceful degradation.
  6. Reporting & Iteration: Set up real-time dashboards, implement regression testing, establish continuous evaluation (in-production monitoring vs. offline benchmarks), and create an iteration loop (identify bottleneck → optimize → re-evaluate).

Output Format Requirements:

  • Benchmark Design: Include objective, primary metric, scope, dataset construction, evaluation methodology, passing criteria, cost analysis, and timeline.
  • Evaluation Rubric: Include dimension, scale, level descriptions with exemplars, rater instructions, and common confusion points.
  • Failure Analysis: Include error category, frequency, impact, root cause, exemplar failures, and mitigation strategies.

Core Principles: Measurement precedes optimization; avoid gaming single metrics; real-world distribution matters; complex judgments require human-in-the-loop; prevent regression over chasing perfect baselines; treat failures as data; reproducibility is non-negotiable; evaluation is continuous, not one-time.

Use Cases

Designing end-to-end LLM evaluation pipelines for product launchesBuilding multi-dimensional model comparison benchmarksCreating standardized scoring guidelines for human evaluation teamsIdentifying high-risk failure modes to improve model safety strategies

Reference Output

A complete evaluation benchmark design including objectives, metrics, dataset specifications, scoring rubrics, and failure analysis, suitable for medium-to-high complexity LLM evaluation projects.

Scoring Rubric

Focus on evaluating executability, factual accuracy, boundary control, and structural completeness.

User Rating

0 ratings
-

Your rating

Log in to rate

Comments

0

Log in to comment

Related Prompts

ImageWriting

Product Marketing - Monochrome Avant-Garde Fashion Portrait

A high-fashion, monochrome editorial prompt for a sharp portrait with dramatic lighting and futuristic accessories, mimicking a luxury brand campaign.

Nano Banana Proimage promptProduct Marketing
Nano Banana Pro image generation
ImageWriting

Social Media Post - Dreamy Woman in Wildflower Field

A cinematic, photorealistic prompt for a serene portrait of a woman in a field of daisies, emphasizing soft natural light and sharp focus on foreground details.

Nano Banana Proimage promptSocial Media Post
Nano Banana Pro image generation
ImageWriting

Social Media Post - Mediterranean Riviera Male Menswear

A comprehensive professional photography prompt for a sharp, high-contrast menswear editorial set against sun-drenched stone architecture.

Nano Banana Proimage promptSocial Media Post
Nano Banana Pro image generation