🎯Evaluator / Scorer

This section provides detailed documentation for all available evaluators in the gllm-evals library.

GEvalGenerationEvaluator

Use when: You want to evaluate RAG or AI Agentic (e.g. AIP) responses with open source DeepEval's GEval metrics, allowing LLM outputs to be scored against any custom criteria.

By default, GEvalGenerationEvaluator runs three metrics: completeness, groundedness, and redundancy.

  1. Completeness: This is deepeval's g-eval completeness score. The score range is between 1 and 3. 1 means not complete, 2 means incomplete, and 3 means complete. It needs query, generated_response, and expected_response to work.

  2. Redundancy: This is deepeval g-eval redundancy score. The score range is between 1 and 3. 1 means no redundancy, 2 at least one redundancy, and 3 means high redundancy. It needs query and generated_response to work.

  3. Groundedness: This is deepeval g-eval groundedness score. The score is between 1 and 3. 1 means not grounded, 2 means at least one grounded, and 3 means fully grounded. It needs query, generated_response, and retrieved_context to work.

  4. Language Consistency: This is deepeval g-eval language consistency score. The score is between 0 and 1. 0 means not consistent, 1 means fully consistent. It needs query and generated_response to work.

  5. Refusal Alignment: This is deepeval g-eval refusal alignment score. The score is between 0 and 1. 1 indicates correct alignment (both are refusal or both are not refusal), 0 indicates incorrect alignment (one is refusal, the other is not). It needs query, generated_response, and expected_response to work.

Fields:

  1. query (str) — The user question.

  2. generated_response (str) — The model's output to be evaluated.

  3. expected_response (str, optional) — The reference or ground truth answer.

  4. retrieved_context (str, optional) — The supporting context/documents used during generation.

Example Usage

import asyncio
import os

from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.types import RAGData


async def main():
    """Main function."""
    data = RAGData(
        query="What is the capital of France?",
        expected_response="Paris",
        generated_response="New York",
        retrieved_context="Paris is the capital of France.",
    )

    evaluator = GEvalGenerationEvaluator(model_credentials=os.getenv("OPENAI_API_KEY"))

    result = await evaluator.evaluate(data)
    print(result)


if __name__ == "__main__":
    asyncio.run(main())

Example Output

AgentEvaluator

Use when: You want to evaluate how well an AI agent makes decisions, uses tools, and follows multi-step reasoning to achieve its goals. If you’re evaluating an AI agent’s overall performance, we suggest using two evaluators: AgentEvaluator (to assess decision-making, tool usage, and reasoning) and GEvalGenerationEvaluator (to assess the quality of the agent’s outputs).

Fields

  1. agent_trajectory (list[dict[str, Any]]) — The actual agent trajectory to be evaluated

  2. expected_agent_trajectory (list[dict[str, Any]], optional) — The reference trajectory for comparison

Configuration Options

  • use_reference (bool): Whether to use reference-based evaluation (default: True)

  • continuous (bool): Use continuous scoring (0.0-1.0) or discrete choices (default: False)

  • choices (list[float]): Available score choices for discrete evaluation (default: [1.0, 0.5, 0.0])

  • use_reasoning (bool): Include detailed explanations in results (default: True)

  • prompt (str, optional): Custom evaluation prompt

Example Usage

Example Output

Custom Prompts

The AgentEvaluator supports custom prompts for both reference-based and reference-free evaluation:

Reference-Based Custom Prompt

Reference-Free Custom Prompt

Scoring System

The evaluator uses a three-tier scoring system:

  • 1.0 ("good"): The trajectory makes logical sense, shows clear progression, and efficiently achieves the goal

  • 0.5 ("incomplete"): The trajectory has logical flaws, poor progression, or fails to achieve the goal effectively

  • 0.0 ("bad"): The trajectory is wrong, cut off, missing steps, or cannot be properly evaluated


ClassicalRetrievalEvaluator

Use when: You want to evaluate retrieval performance with classical IR metrics (MAP, NDCG, Precision, Recall, Top-K Accuracy).

Fields:

  1. retrieved_chunks (dict[str, float]) — The dictionary of retrieved documents/chunks containing the chunk id and its score.

  2. ground_truth_chunk_ids (list[str]) — The list of reference chunk ids marking relevant chunks.

Example Usage

Example Output

QueryTransformerEvaluator

Use when: You want to evaluate query transformation tasks, checking how well queries are rewritten, expanded, or paraphrased for downstream use.

Fields:

  1. query (str) — The original input query.

  2. generated_response (list[str]) — The model's transformed query output to be evaluated.

  3. expected_response (list[str]) — The reference or ground truth transformed query.

Example Usage

Example Output


Initialization & Common Parameters

All evaluators accept:

  • model: str | BaseLMInvoker

    • Use a string for quick setup (e.g., "openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"), or

    • Pass a BaseLMInvoker instance for more advanced configuration. See Language Model (LM) Invoker for more details and supported invokers.

Example Usage — Using OpenAICompatibleLMInvoker


Looking for something else? Build your own custom evaluator here.

*All fields are optional and can be adjusted depending on the chosen metric.

Last updated