Evaluation Workflow
This page describes a practical, step-by-step workflow to build reliable evaluations using gllm-evals. The goal is to make evaluation repeatable, representative of production, and reusable across projects.
Step 1: Define what you are evaluating (the Evaluand)
Start by defining the Evaluand: the GenAI component/system whose outputs you want to measure.
For each Evaluand, write down:
What it is: QnA agent, RAG pipeline, summarizer, retriever, agent workflow, etc.
What task it performs: the job-to-be-done and expected behavior.
Example: “Answer questions by retrieving records and returning an explained answer with evidence/provenance.”
Example: “Summarize long documents into an executive summary without hallucinations.”
Typical input: user question, document text, retrieved context, metadata, constraints.
Typical output: answer text, structured fields, evidence/provenance, retrieved chunks, tool trace.
Also clarify the evaluation purpose:
Quality measurement (offline): regression testing, model/prompt comparison, release readiness (Anthropic)
In-system critique (online): evaluator acts as a critic to trigger revision, escalation, or guardrails (see Reference-less section) (Anthropic)
Step 2: Define success criteria (what “good” means)
Write explicit criteria as a short checklist or rubric. Examples:
Correctness: answers the right entity / correct numeric value
Completeness: covers required fields
Groundedness: supported by provided context / evidence
Retrieval quality: relevant items retrieved, irrelevant minimized
Policy alignment: refusal correctness, safety boundaries
Best practice: criteria should be task-specific and reflect real-world usage, including edge cases. (Anthropic)
Step 3: Collect an evaluation dataset (start small, then harden)
3.1 Start small and iterate (dataset can be incremental)
Do not wait for a “perfect dataset.” Start with a small seed set and add rows incrementally as you learn more about failure modes and new user behaviors. (OpenAI Platform)
Practical approach:
Start with 5–10 samples to unblock evaluation quickly
Grow to a baseline set (see recommended minimum below)
Continue adding samples as production evolves
3.2 Make the dataset representative of production
A good dataset mirrors typical production inputs and outputs for the Evaluand.
Include cases that represent:
Top user intents (most frequent question types)
Typical phrasing and ambiguity
Common retrieval patterns (short queries, long queries, multi-entity)
Common output shapes (short factual answers, multi-part answers, structured outputs with evidence)
OpenAI’s guidance recommends mixing production data and expert-created examples to define what “good” looks like, and to keep growing the eval set over time. (OpenAI Platform)
3.3 Recommended minimum size
Rule of thumb:
Initial dataset: 5–10 samples
Usable baseline: ≥ 20 samples (minimum)
Grow continuously over time (especially when production changes)
3.4 Dataset columns (examples)
Common columns depend on the Evaluand and evaluation type:
QnA / RAG
question
expected_response (for reference-based)
reference_docs or reference_context (optional)
generated_response (optional, if you store outputs)
Retrieval
query
relevant_doc_ids / relevant_chunks (optional)
retrieved_context (optional, if Evaluand returns it)
Summarization
source_text
expected_summary (optional)
rubric (optional, if using reference-less)
Example public CSV format you can copy (question + expected answer style):
Example workflow & dataset format: (docs.ragas.io)
3.5 Grow the dataset from production errors
Treat evaluation as a living asset:
When production shows an error (wrong entity, hallucination, bad retrieval), add that case to the dataset.
Label it with expected outcome (or rubric expectations).
This turns incidents into regression tests and prevents repeats. (OpenAI Platform)
Step 4: Choose the evaluator type and metrics (prefer existing before custom)
Choose the fastest, most reliable approach that fits the task:
Programmatic grading (rules, parsing, schema checks, deterministic checks)
LLM-based grading (rubric / judge)
Human expert review (highest quality, used for calibration and spot-checks) (Anthropic)
In gllm-evals, check existing components before creating custom ones:
Existing evaluators: https://gdplabs.gitbook.io/sdk/tutorials/evaluation/evaluator-scorer
Existing metrics: https://gdplabs.gitbook.io/sdk/tutorials/evaluation/metric
Custom evaluator tutorial: https://gdplabs.gitbook.io/sdk/tutorials/evaluation/custom-evaluator-scorer-tutorial
Best practice: combine multiple signals when needed (metric-based + LLM-judge + selective expert review). (OpenAI Platform)
Step 5: Calibrate evaluator alignment (make sure the judge is judging correctly)
This step is where many evaluation systems fail: the evaluator exists, but it does not match what experts consider “good.”
5.1 For LLM-as-judge evaluators: calibrate against human experts
LLM judges should be calibrated against human expert judgment early, then scaled once agreement is strong. (Anthropic)
Practical calibration loop:
Sample ~10–20 evaluation rows
Have human experts grade them using the same rubric
Compare human expert vs evaluator outputs
Turn disagreements into improvements:
clarify rubric wording
tighten scoring criteria
adjust thresholds
add counterexamples
Repeat until disagreements drop meaningfully
5.2 Maintain alignment as you evolve
As you change prompts/models/retrieval, re-check alignment periodically. Anthropic specifically highlights periodic human calibration for LLM graders. (Anthropic)
Step 6: Run evaluation and inspect failures
After you run evals:
Review low-score samples
Identify failure categories:
retrieval failures (missing/irrelevant context)
generation failures (wrong entity, hallucination)
formatting/policy failures (refusal, tone)
evaluator/rubric issues (misgraded cases)
Best practice: iterate: baseline → error analysis → improve → re-evaluate. (docs.ragas.io)
Step 7: Improve the Evaluand and the evaluation assets
When you fix issues, decide what to change:
Evaluand improvements: prompts, retrieval settings, tool routing, grounding strategy
Dataset improvements: add new production cases, add edge cases, refine labels
Evaluator improvements: refine rubrics, split dimensions, adjust thresholds
Keep the dataset and rubric evolving as production changes. (OpenAI Platform)
Step 8: Set targets, run experiments, and enforce release gates
This is where evaluation becomes operational.
8.1 Set a target (quality bar) before changes
Before switching to a new LLM model, prompt, or retriever, define a measurable target such as:
“Overall pass rate ≥ 80% on the capability eval set”
“Groundedness score ≥ 0.85”
“Correctness score ≥ 0.80”
OpenAI’s evaluation best practices explicitly recommends defining evaluation metrics and setting target thresholds (example thresholds like “coherence score of at least 80%”) before comparing and rolling out changes. (OpenAI Platform)
8.2 Maintain two suites: capability vs regression
A practical pattern:
Capability evals: help you climb from low pass rate upward on hard tasks.
Regression evals: should be close to 100% pass rate and catch backsliding on already-solved tasks.
Anthropic recommends this split and describes “graduating” capability tasks into regression tests once the system becomes reliable. (Anthropic)
8.3 Track changes and compare runs
Track experiment results across:
model versions
prompts
retrieval configs
agent logic changes
dataset revisions
Then enforce:
“no regression” gates (regression suite must not drop)
“ship gates” (capability suite must reach your target)
OpenAI recommends continuously evaluating and growing the eval set over time to keep these gates meaningful. (OpenAI Platform)
Reference-based vs Reference-less evaluation
Reference-based (with ground truth / expected answer)
Use when you have expected_response (or labeled references) and want accuracy-like measurement:
QnA correctness
retrieval evaluation vs known relevant docs/chunks
Reference-less (no ground truth)
Use when ground truth is unavailable or expensive:
rubric-based judging (coherence, groundedness, bias/one-sidedness)
in-system critic: evaluator runs inside the pipeline to critique outputs and trigger retries, revisions, or escalation (Anthropic). Reference-less evaluators should be calibrated carefully against human expert judgement and have clear, structured rubrics. (Anthropic)
Last updated
Was this helpful?