🎯Evaluator / Scorer
This section provides detailed documentation for all available evaluators in the gllm-evals library.
Available Evaluators
GEvalGenerationEvaluator
Use when: You want to evaluate RAG or AI Agentic (e.g. AIP) responses with open source DeepEval's GEval metrics, allowing LLM outputs to be scored against any custom criteria.
By default, GEvalGenerationEvaluator runs three metrics: completeness, groundedness, and redundancy.
Completeness: This is deepeval's g-eval completeness score. The score range is between 1 and 3. 1 means not complete, 2 means incomplete, and 3 means complete. It needs query, generated_response, and expected_response to work.
Redundancy: This is deepeval g-eval redundancy score. The score range is between 1 and 3. 1 means no redundancy, 2 at least one redundancy, and 3 means high redundancy. It needs query and generated_response to work.
Groundedness: This is deepeval g-eval groundedness score. The score is between 1 and 3. 1 means not grounded, 2 means at least one grounded, and 3 means fully grounded. It needs query, generated_response, and retrieved_context to work.
Language Consistency: This is deepeval g-eval language consistency score. The score is between 0 and 1. 0 means not consistent, 1 means fully consistent. It needs query and generated_response to work.
Refusal Alignment: This is deepeval g-eval refusal alignment score. The score is between 0 and 1. 1 indicates correct alignment (both are refusal or both are not refusal), 0 indicates incorrect alignment (one is refusal, the other is not). It needs query, generated_response, and expected_response to work.
Fields:
query (str) — The user question.
generated_response (str) — The model's output to be evaluated.
expected_response (str, optional) — The reference or ground truth answer.
retrieved_context (str, optional) — The supporting context/documents used during generation.
Example Usage
Example Output
AgentEvaluator
Use when: You want to evaluate how well an AI agent makes decisions, uses tools, and follows multi-step reasoning to achieve its goals. If you’re evaluating an AI agent’s overall performance, we suggest using two evaluators: AgentEvaluator (to assess decision-making, tool usage, and reasoning) and GEvalGenerationEvaluator (to assess the quality of the agent’s outputs).
Fields
agent_trajectory (list[dict[str, Any]]) — The actual agent trajectory to be evaluated
expected_agent_trajectory (list[dict[str, Any]], optional) — The reference trajectory for comparison
Configuration Options
use_reference (bool): Whether to use reference-based evaluation (default: True)
continuous (bool): Use continuous scoring (0.0-1.0) or discrete choices (default: False)
choices (list[float]): Available score choices for discrete evaluation (default: [1.0, 0.5, 0.0])
use_reasoning (bool): Include detailed explanations in results (default: True)
prompt (str, optional): Custom evaluation prompt
Example Usage
Example Output
Custom Prompts
The AgentEvaluator supports custom prompts for both reference-based and reference-free evaluation:
Reference-Based Custom Prompt
Reference-Free Custom Prompt
Scoring System
The evaluator uses a three-tier scoring system:
1.0 ("good"): The trajectory makes logical sense, shows clear progression, and efficiently achieves the goal
0.5 ("incomplete"): The trajectory has logical flaws, poor progression, or fails to achieve the goal effectively
0.0 ("bad"): The trajectory is wrong, cut off, missing steps, or cannot be properly evaluated
ClassicalRetrievalEvaluator
Use when: You want to evaluate retrieval performance with classical IR metrics (MAP, NDCG, Precision, Recall, Top-K Accuracy).
Fields:
retrieved_chunks (dict[str, float]) — The dictionary of retrieved documents/chunks containing the chunk id and its score.
ground_truth_chunk_ids (list[str]) — The list of reference chunk ids marking relevant chunks.
Example Usage
Example Output
LMBasedRetrievalEvaluator
Use when: You want to evaluate the retrieval step of a RAG pipeline with LM-based metrics, combining their scores into a simple relevancy rating, final score, and issue hints.
By default, LMBasedRetrievalEvaluator runs two metrics: contextual precision and contextual recall, then applies a rule engine to classify the retrieval quality.
Contextual Precision: DeepEval contextual precision. Scores range from 0 to 1. It checks whether relevant context is ranked above irrelevant context for the given query and expected answer. Needs
query,expected_response, andretrieved_context.Contextual Recall: DeepEval contextual recall. Scores range from 0 to 1. It measures how well the retrieved context aligns with the expected answer. Needs
query,expected_response, andretrieved_context. The default rule engine uses this metric to determine the retrieval relevancy rating (good / bad).
Fields:
query (str) — The user question.
expected_response (str) — The reference or ground truth answer.
retrieved_context (str | list[str]) — The supporting context/documents used during retrieval. Strings are coerced into a single-element list.
Example Usage
Example Output
RAGEvaluator
Use when: You want a single evaluator that scores both retrieval and generation quality for a RAG pipeline, combining LM-based DeepEval retrieval metrics with G-Eval generation metrics into an overall RAG rating, score, and issue hints.
By default, RAGEvaluator runs the LM-based retrieval evaluator (contextual precision, contextual recall) and the GEval generation evaluator (completeness, redundancy, groundedness, language consistency, refusal alignment), then applies a rule engine to classify the end-to-end RAG response. The default rule is using GEval generation rule.
Fields:
query (str) — The user question.
expected_response (str) — The reference or ground truth answer.
generated_response (str) — The model's generated answer to score.
retrieved_context (str | list[str]) — The supporting context/documents used during retrieval. Strings are coerced into a single-element list.
Example Usage
Example Output
QueryTransformerEvaluator
Use when: You want to evaluate query transformation tasks, checking how well queries are rewritten, expanded, or paraphrased for downstream use.
Fields:
query (str) — The original input query.
generated_response (list[str]) — The model's transformed query output to be evaluated.
expected_response (list[str]) — The reference or ground truth transformed query.
Example Usage
Example Output
Initialization & Common Parameters
All evaluators accept:
model:str | BaseLMInvokerUse a string for quick setup (e.g.,
"openai/gpt-4o-mini","anthropic/claude-3-5-sonnet"), orPass a BaseLMInvoker instance for more advanced configuration. See Language Model (LM) Invoker for more details and supported invokers.
Example Usage — Using OpenAICompatibleLMInvoker
Looking for something else? Build your own custom evaluator here.
*All fields are optional and can be adjusted depending on the chosen metric.
Last updated