arrow-progressEvaluation Workflow

This page describes a practical, step-by-step workflow to build reliable evaluations using gllm-evals. The goal is to make evaluation repeatable, representative of production, and reusable across projects.

Step 1: Define what you are evaluating (the Evaluand)

Start by defining the Evaluand: the GenAI component/system whose outputs you want to measure.

For each Evaluand, write down:

  • What it is: QnA agent, RAG pipeline, summarizer, retriever, agent workflow, etc.

  • What task it performs: the job-to-be-done and expected behavior.

  • Example: “Answer questions by retrieving records and returning an explained answer with evidence/provenance.”

  • Example: “Summarize long documents into an executive summary without hallucinations.”

  • Typical input: user question, document text, retrieved context, metadata, constraints.

  • Typical output: answer text, structured fields, evidence/provenance, retrieved chunks, tool trace.

Also clarify the evaluation purpose:

  • Quality measurement (offline): regression testing, model/prompt comparison, release readiness (Anthropicarrow-up-right)

In-system critique (online): evaluator acts as a critic to trigger revision, escalation, or guardrails (see Reference-less section) (Anthropicarrow-up-right)

Step 2: Define success criteria (what “good” means)

Write explicit criteria as a short checklist or rubric. Examples:

  • Correctness: answers the right entity / correct numeric value

  • Completeness: covers required fields

  • Groundedness: supported by provided context / evidence

  • Retrieval quality: relevant items retrieved, irrelevant minimized

  • Policy alignment: refusal correctness, safety boundaries

Best practice: criteria should be task-specific and reflect real-world usage, including edge cases. (Anthropicarrow-up-right)

Step 3: Collect an evaluation dataset (start small, then harden)

3.1 Start small and iterate (dataset can be incremental)

Do not wait for a “perfect dataset.” Start with a small seed set and add rows incrementally as you learn more about failure modes and new user behaviors. (OpenAI Platformarrow-up-right)

Practical approach:

  • Start with 5–10 samples to unblock evaluation quickly

  • Grow to a baseline set (see recommended minimum below)

  • Continue adding samples as production evolves

3.2 Make the dataset representative of production

A good dataset mirrors typical production inputs and outputs for the Evaluand.

Include cases that represent:

  • Top user intents (most frequent question types)

  • Typical phrasing and ambiguity

  • Common retrieval patterns (short queries, long queries, multi-entity)

  • Common output shapes (short factual answers, multi-part answers, structured outputs with evidence)

OpenAI’s guidance recommends mixing production data and expert-created examples to define what “good” looks like, and to keep growing the eval set over time. (OpenAI Platformarrow-up-right)

Rule of thumb:

  • Initial dataset: 5–10 samples

  • Usable baseline: ≥ 20 samples (minimum)

  • Grow continuously over time (especially when production changes)

3.4 Dataset columns (examples)

Common columns depend on the Evaluand and evaluation type:

QnA / RAG

  • question

  • expected_response (for reference-based)

  • reference_docs or reference_context (optional)

  • generated_response (optional, if you store outputs)

Retrieval

  • query

  • relevant_doc_ids / relevant_chunks (optional)

  • retrieved_context (optional, if Evaluand returns it)

Summarization

  • source_text

  • expected_summary (optional)

  • rubric (optional, if using reference-less)

Example public CSV format you can copy (question + expected answer style):

3.5 Grow the dataset from production errors

Treat evaluation as a living asset:

  • When production shows an error (wrong entity, hallucination, bad retrieval), add that case to the dataset.

  • Label it with expected outcome (or rubric expectations).

  • This turns incidents into regression tests and prevents repeats. (OpenAI Platformarrow-up-right)

Step 4: Choose the evaluator type and metrics (prefer existing before custom)

Choose the fastest, most reliable approach that fits the task:

  1. Programmatic grading (rules, parsing, schema checks, deterministic checks)

  2. LLM-based grading (rubric / judge)

  3. Human expert review (highest quality, used for calibration and spot-checks) (Anthropicarrow-up-right)

In gllm-evals, check existing components before creating custom ones:

Best practice: combine multiple signals when needed (metric-based + LLM-judge + selective expert review). (OpenAI Platformarrow-up-right)

Step 5: Calibrate evaluator alignment (make sure the judge is judging correctly)

This step is where many evaluation systems fail: the evaluator exists, but it does not match what experts consider “good.”

5.1 For LLM-as-judge evaluators: calibrate against human experts

LLM judges should be calibrated against human expert judgment early, then scaled once agreement is strong. (Anthropicarrow-up-right)

Practical calibration loop:

  • Sample ~10–20 evaluation rows

  • Have human experts grade them using the same rubric

  • Compare human expert vs evaluator outputs

  • Turn disagreements into improvements:

  • clarify rubric wording

  • tighten scoring criteria

  • adjust thresholds

  • add counterexamples

  • Repeat until disagreements drop meaningfully

5.2 Maintain alignment as you evolve

As you change prompts/models/retrieval, re-check alignment periodically. Anthropic specifically highlights periodic human calibration for LLM graders. (Anthropicarrow-up-right)

Step 6: Run evaluation and inspect failures

After you run evals:

  • Review low-score samples

  • Identify failure categories:

  • retrieval failures (missing/irrelevant context)

  • generation failures (wrong entity, hallucination)

  • formatting/policy failures (refusal, tone)

  • evaluator/rubric issues (misgraded cases)

Best practice: iterate: baseline → error analysis → improve → re-evaluate. (docs.ragas.ioarrow-up-right)

Step 7: Improve the Evaluand and the evaluation assets

When you fix issues, decide what to change:

  • Evaluand improvements: prompts, retrieval settings, tool routing, grounding strategy

  • Dataset improvements: add new production cases, add edge cases, refine labels

  • Evaluator improvements: refine rubrics, split dimensions, adjust thresholds

Keep the dataset and rubric evolving as production changes. (OpenAI Platformarrow-up-right)

Step 8: Set targets, run experiments, and enforce release gates

This is where evaluation becomes operational.

8.1 Set a target (quality bar) before changes

Before switching to a new LLM model, prompt, or retriever, define a measurable target such as:

  • “Overall pass rate ≥ 80% on the capability eval set”

  • “Groundedness score ≥ 0.85”

  • “Correctness score ≥ 0.80”

OpenAI’s evaluation best practices explicitly recommends defining evaluation metrics and setting target thresholds (example thresholds like “coherence score of at least 80%”) before comparing and rolling out changes. (OpenAI Platformarrow-up-right)

8.2 Maintain two suites: capability vs regression

A practical pattern:

  • Capability evals: help you climb from low pass rate upward on hard tasks.

  • Regression evals: should be close to 100% pass rate and catch backsliding on already-solved tasks.

Anthropic recommends this split and describes “graduating” capability tasks into regression tests once the system becomes reliable. (Anthropicarrow-up-right)

8.3 Track changes and compare runs

Track experiment results across:

  • model versions

  • prompts

  • retrieval configs

  • agent logic changes

  • dataset revisions

Then enforce:

  • “no regression” gates (regression suite must not drop)

  • “ship gates” (capability suite must reach your target)

OpenAI recommends continuously evaluating and growing the eval set over time to keep these gates meaningful. (OpenAI Platformarrow-up-right)


Reference-based vs Reference-less evaluation

Reference-based (with ground truth / expected answer)

Use when you have expected_response (or labeled references) and want accuracy-like measurement:

  • QnA correctness

  • retrieval evaluation vs known relevant docs/chunks

Reference-less (no ground truth)

Use when ground truth is unavailable or expensive:

  • rubric-based judging (coherence, groundedness, bias/one-sidedness)

in-system critic: evaluator runs inside the pipeline to critique outputs and trigger retries, revisions, or escalation (Anthropicarrow-up-right). Reference-less evaluators should be calibrated carefully against human expert judgement and have clear, structured rubrics. (Anthropicarrow-up-right)

Last updated

Was this helpful?