🎯Evaluator / Scorer

This section provides detailed documentation for all available evaluators in the gllm-evals library.

GEvalGenerationEvaluator

Use when: You want to evaluate RAG or AI Agentic (e.g. AIP) responses with open source DeepEval's GEval metrics, allowing LLM outputs to be scored against any custom criteria.

By default, GEvalGenerationEvaluator runs three metrics: completeness, groundedness, and redundancy.

Completeness: This is deepeval's g-eval completeness score. The score range is between 1 and 3. 1 means not complete, 2 means incomplete, and 3 means complete. It needs query, generated_response, and expected_response to work.
Redundancy: This is deepeval g-eval redundancy score. The score range is between 1 and 3. 1 means no redundancy, 2 at least one redundancy, and 3 means high redundancy. It needs query and generated_response to work.
Groundedness: This is deepeval g-eval groundedness score. The score is between 1 and 3. 1 means not grounded, 2 means at least one grounded, and 3 means fully grounded. It needs query, generated_response, and retrieved_context to work.
Language Consistency: This is deepeval g-eval language consistency score. The score is between 0 and 1. 0 means not consistent, 1 means fully consistent. It needs query and generated_response to work.
Refusal Alignment: This is deepeval g-eval refusal alignment score. The score is between 0 and 1. 1 indicates correct alignment (both are refusal or both are not refusal), 0 indicates incorrect alignment (one is refusal, the other is not). It needs query, generated_response, and expected_response to work.

Fields:

query (str) — The user question.
generated_response (str) — The model's output to be evaluated.
expected_response (str, optional) — The reference or ground truth answer.
retrieved_context (str, optional) — The supporting context/documents used during generation.

Example Usage

import asyncio
import os

from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.types import RAGData


async def main():
    """Main function."""
    data = RAGData(
        query="What is the capital of France?",
        expected_response="Paris",
        generated_response="New York",
        retrieved_context="Paris is the capital of France.",
    )

    evaluator = GEvalGenerationEvaluator(model_credentials=os.getenv("OPENAI_API_KEY"))

    result = await evaluator.evaluate(data)
    print(result)


if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
    "generation": {
        "global_explanation": "The following metrics failed to meet expectations:\n1. Completeness is 1 (should be 3)\n2. Groundedness is 1 (should be 3)",
        "relevancy_rating": "bad",
        "possible_issues": ["Retrieval Issue", "Generation Issue"],
        "score": 0,
        "completeness": {
            "score": 1,
            "explanation": "The response provided an incorrect answer to the question about the capital of France. This constitutes a critical factual contradiction, as 'New York' is not the correct capital."
        },
        "groundedness": {
            "score": 1,
            "explanation": "The generated output, 'New York', is completely unsupported and directly contradicts the provided context, which explicitly states that 'Paris is the capital of France.'"
        },
        "redundancy": {
            "score": 1,
            "explanation": "The response consists of only two words, 'New York'. It is extremely concise and presents a single idea without any repetition of words, phrases, or concepts."
        },
        "language_consistency": {
            "score": true,
            "explanation": "The instructional language of the input is English, and the actual output is also in English. Therefore, the language is consistent.",
            "success": true
        },
        "refusal_alignment": {
            "score": true,
            "explanation": "The expected response directly answers the query 'What is the capital of France?' with 'Paris' and is therefore not a refusal. The generated response also directly answers the query with 'New York', which is also not a refusal. Since both the expected and generated responses are not refusals, their refusal statuses align.",
            "success": true
        }
    }
}

AgentEvaluator

Use when: You want to evaluate how well an AI agent makes decisions, uses tools, and follows multi-step reasoning to achieve its goals. If you’re evaluating an AI agent’s overall performance, we suggest using two evaluators: AgentEvaluator (to assess decision-making, tool usage, and reasoning) and GEvalGenerationEvaluator (to assess the quality of the agent’s outputs).

Fields

agent_trajectory (list[dict[str, Any]]) — The actual agent trajectory to be evaluated
expected_agent_trajectory (list[dict[str, Any]], optional) — The reference trajectory for comparison

Configuration Options

use_reference (bool): Whether to use reference-based evaluation (default: True)
continuous (bool): Use continuous scoring (0.0-1.0) or discrete choices (default: False)
choices (list[float]): Available score choices for discrete evaluation (default: [1.0, 0.5, 0.0])
use_reasoning (bool): Include detailed explanations in results (default: True)
prompt (str, optional): Custom evaluation prompt

Example Usage

import asyncio
import json
import os

from gllm_evals.constant import DefaultValues
from gllm_evals.dataset.simple_agent_dataset import load_simple_agent_dataset
from gllm_evals.evaluator.agent_evaluator import AgentEvaluator
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator

async def main():
    """Example of using both AgentEvaluator and GEvalGenerationEvaluator."""
    # Load dataset
    dataset = load_simple_agent_dataset()

    # Initialize AgentEvaluator
    agent_evaluator = AgentEvaluator(model=DefaultValues.MODEL, 
        model_credentials=os.getenv("OPENAI_API_KEY")
    )

    # Initialize GEvalGenerationEvaluator
    generation_evaluator = GEvalGenerationEvaluator(
        model=DefaultValues.MODEL, model_credentials=os.getenv("OPENAI_API_KEY")
    )

    # Evaluate agent trajectory
    agent_result = await agent_evaluator.evaluate(dataset[0])
    print(json.dumps(agent_result, indent=2))

    # Evaluate generation quality
    generation_result = await generation_evaluator.evaluate(dataset[0])
    print(json.dumps(generation_result, indent=2))


if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "agent_evals": {
    "langchain_agent_trajectory_accuracy": {
      "score": 1.0,
      "explanation": "The actual trajectory closely matches the expected reference. The user query and assistant response are functionally equivalent to the reference: the agent provides the weather (75 degrees and partly cloudy) for San Francisco in a clear and direct manner, efficiently addressing the user's request. Minor wording differences do not affect the substance or logic of the response, and the goal is fully achieved. There are no logical flaws or missing steps.",
      "key": "trajectory_accuracy",
      "metadata": null
    }
  }
}
{
  "geval_generation_evals": {
    "relevancy_rating": "good",
    "possible_issues": [],
    "score": 1,
    "completeness": {
      "score": 3,
      "explanation": "All substantive statements from the expected output are present in the actual output: the weather is reported as 75 degrees and partly cloudy in San Francisco. The information matches both key data points (temperature and condition) and the relevant location."
    },
    "redundancy": {
      "score": 1,
      "explanation": "The response presents the weather in San Francisco in a single, clear sentence without any repeated information or unnecessary elaboration. Each key point\u2014the temperature and the condition\u2014are stated only once, making the response concise and direct."
    }
  }
}

Custom Prompts

The AgentEvaluator supports custom prompts for both reference-based and reference-free evaluation:

Reference-Based Custom Prompt

from gllm_evals.prompts.agentevals_prompt import TRAJECTORY_ACCURACY_CUSTOM_PROMPT_WITH_REFERENCE

agent_evaluator = AgentEvaluator(
    model="openai/gpt-4.1",
    model_credentials=os.getenv("OPENAI_API_KEY"),
    prompt=TRAJECTORY_ACCURACY_CUSTOM_PROMPT_WITH_REFERENCE,
    use_reference=True
)

Reference-Free Custom Prompt

from gllm_evals.prompts.agentevals_prompt import TRAJECTORY_ACCURACY_CUSTOM_PROMPT

agent_evaluator = AgentEvaluator(
    model="openai/gpt-4.1",
    model_credentials=os.getenv("OPENAI_API_KEY"),
    prompt=TRAJECTORY_ACCURACY_CUSTOM_PROMPT,
    use_reference=False
)

Scoring System

The evaluator uses a three-tier scoring system:

1.0 ("good"): The trajectory makes logical sense, shows clear progression, and efficiently achieves the goal
0.5 ("incomplete"): The trajectory has logical flaws, poor progression, or fails to achieve the goal effectively
0.0 ("bad"): The trajectory is wrong, cut off, missing steps, or cannot be properly evaluated

ClassicalRetrievalEvaluator

Use when: You want to evaluate retrieval performance with classical IR metrics (MAP, NDCG, Precision, Recall, Top-K Accuracy).

Fields:

retrieved_chunks (dict[str, float]) — The dictionary of retrieved documents/chunks containing the chunk id and its score.
ground_truth_chunk_ids (list[str]) — The list of reference chunk ids marking relevant chunks.

Example Usage

import asyncio

from gllm_evals.evaluator.classical_retrieval_evaluator import ClassicalRetrievalEvaluator
from gllm_evals.types import RetrievalData


async def main():
    """Main function."""
    data = RetrievalData(
        retrieved_chunks={
            "chunk1": 9.0,
            "chunk2": 0.0,
            "chunk3": 0.3,
            "chunk4": 0.1,
            "chunk5": 0.2,
            "chunk6": 0.4,
            "chunk7": 0.5,
            "chunk8": 0.6,
            "chunk9": 0.7,
            "chunk10": 0.8,
        },
        ground_truth_chunk_ids=["chunk9", "chunk3", "chunk2"],
    )
    evaluator = ClassicalRetrievalEvaluator(k=[5, 10])
    results = await evaluator.evaluate(data)
    print(results)


if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "classical_retrieval_evals": {
    "precision@5": {
      "score": 0.2,
      "explanation": null
    },
    "precision@10": {
      "score": 0.3,
      "explanation": null
    },
    "recall@5": {
      "score": 0.3333333333333333,
      "explanation": null
    },
    "recall@10": {
      "score": 1.0,
      "explanation": null
    },
    "ndcg@5": {
      "score": 0.23463936301137822,
      "explanation": null
    },
    "ndcg@10": {
      "score": 0.5267175784514114,
      "explanation": null
    },
    "map@5": {
      "score": 0.1111111111111111,
      "explanation": null
    },
    "map@10": {
      "score": 0.30634920634920637,
      "explanation": null
    },
    "top_k_accuracy@5": {
      "score": 1.0,
      "explanation": null
    },
    "top_k_accuracy@10": {
      "score": 1.0,
      "explanation": null
    }
  }
}

QueryTransformerEvaluator

Use when: You want to evaluate query transformation tasks, checking how well queries are rewritten, expanded, or paraphrased for downstream use.

Fields:

query (str) — The original input query.
generated_response (list[str]) — The model's transformed query output to be evaluated.
expected_response (list[str]) — The reference or ground truth transformed query.

Example Usage

import asyncio
import os

from gllm_evals.evaluator.qt_evaluator import QTEvaluator
from gllm_evals.types import QAData


async def main():
    """Main function."""
    data = QAData(
        query="Siapa yang bertanggung jawab atas pemantauan kepatuhan terintegrasi dan bagaimana cara melaporkannya?",
        expected_response=[
            "Direktur yang membawahi fungsi kepatuhan di XYZ sebagai Entitas Utama bertanggung jawab atas pemantauan dan evaluasi penerapan kepatuhan pada masing-masing LJK dalam KK XYZ.", 
            "Pelaporan dilakukan dengan menyusun dan menyampaikan laporan pelaksanaan tugas dan tanggung jawab kepatuhan terintegrasi kepada Direksi dan Dewan Komisaris XYZ."],
        generated_response=['penanggung jawab pemantauan kepatuhan terintegrasi', 'prosedur pelaporan kepatuhan terintegrasi']
    )

    evaluator = QTEvaluator(model_credentials=os.getenv("OPENAI_API_KEY"))

    result = await evaluator.evaluate(data)
    print(result)


if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "qt_evals": {
    "completeness": {
      "score": 1,
      "explanation": "The generated output consists of only headers or topic names ('penanggung jawab pemantauan kepatuhan terintegrasi' and 'prosedur pelaporan kepatuhan terintegrasi') without any substantive information or answers relevant to the question. None of the key details from the expected output\u2014that the director responsible for compliance at XYZ is accountable for monitoring, and that reports are made to the Board of Directors and Board of Commissioners\u2014are present. Therefore, none of the expected substantive statements are matched.",
      "question": "Siapa yang bertanggung jawab atas pemantauan kepatuhan terintegrasi dan bagaimana cara melaporkannya?",
      "expected_output_statements": [
        "Direktur yang membawahi fungsi kepatuhan di XYZ sebagai Entitas Utama bertanggung jawab atas pemantauan dan evaluasi penerapan kepatuhan pada masing-masing LJK dalam KK XYZ.",
        "Pelaporan dilakukan dengan menyusun dan menyampaikan laporan pelaksanaan tugas dan tanggung jawab kepatuhan terintegrasi kepada Direksi dan Dewan Komisaris XYZ."
      ],
      "generated_output_statements": [
        "penanggung jawab pemantauan kepatuhan terintegrasi",
        "prosedur pelaporan kepatuhan terintegrasi"
      ],
      "count": "0 of 2 substantive statements are matched"
    },
    "groundedness": {
      "score": 2,
      "explanation": "Generated output hanya berupa pengulangan istilah yang sama dengan yang terdapat di konteks dan pertanyaan, tanpa adanya informasi yang lebih detail atau jawaban faktual, namun masih relevan secara topik sehingga bukan halusinasi sepenuhnya.",
      "generated_response": [
        "penanggung jawab pemantauan kepatuhan terintegrasi",
        "prosedur pelaporan kepatuhan terintegrasi"
      ],
      "analysis": [
        "Pernyataan ini hanya mengulang kata kunci dari pertanyaan dan konteks, tidak memberikan jawaban faktual mengenai siapa penanggung jawabnya.",
        "Pernyataan ini hanya menyinggung tentang prosedur pelaporan tanpa memberikan detail atau jawaban yang didukung konteks."
      ]
    },
    "redundancy": {
      "score": 1,
      "explanation": "Each key point is presented only once. The first addresses 'who is responsible' and the second addresses 'how to report', directly answering both parts of the question without any repetition or unnecessary elaboration. There is no redundant restatement or overlapping information.",
      "generated_response": [
        "penanggung jawab pemantauan kepatuhan terintegrasi",
        "prosedur pelaporan kepatuhan terintegrasi"
      ],
      "analysis": [
        "The first statement identifies the party responsible for compliance monitoring.",
        "The second statement refers to the procedures for reporting compliance monitoring."
      ]
    },
    "score": 0.44999999999999996
  }
}

Initialization & Common Parameters

All evaluators accept:

model: str | BaseLMInvoker
- Use a string for quick setup (e.g., "openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"), or
- Pass a BaseLMInvoker instance for more advanced configuration. See Language Model (LM) Invoker for more details and supported invokers.

Example Usage — Using OpenAICompatibleLMInvoker

import asyncio
import os

from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.types import RAGData
from gllm_inference.lm_invoker import OpenAICompatibleLMInvoker


async def main():
    """Main function."""
    data = RAGData(
        query="What is the capital of France?",
        expected_response="Paris",
        generated_response="New York",
        retrieved_context="Paris is the capital of France.",
    )

    lm_invoker = OpenAICompatibleLMInvoker(
        base_url="https://abc-vllm.obrol.id/ ",
        model_name="Qwen/Qwen3-Next-80B-A3B-Instruct",
        api_key="abc123",
    )

    evaluator = GEvalGenerationEvaluator(model=lm_invoker)

    result = await evaluator.evaluate(data)
    print(result)


if __name__ == "__main__":
    asyncio.run(main())

Looking for something else? Build your own custom evaluator here.

^{*All fields are optional and can be adjusted depending on the chosen metric.}

PreviousEnd-to-End Evaluation NextEvaluator / Scorer Configuration

Last updated 2 months ago

Was this helpful?

hashtagGEvalGenerationEvaluator

hashtagFields:

hashtagExample Usage

hashtagExample Output

hashtagAgentEvaluator

hashtagFields

hashtagConfiguration Options

hashtagExample Usage

hashtagExample Output

hashtagCustom Prompts

hashtagScoring System

hashtagClassicalRetrievalEvaluator

hashtagExample Usage

hashtagExample Output

hashtagQueryTransformerEvaluator

hashtagExample Usage

hashtagExample Output

hashtagInitialization & Common Parameters

hashtagExample Usage — Using OpenAICompatibleLMInvoker

GEvalGenerationEvaluator

Fields:

Example Usage

Example Output

AgentEvaluator

Fields

Configuration Options

Example Usage

Example Output

Custom Prompts

Scoring System

ClassicalRetrievalEvaluator

Example Usage

Example Output

QueryTransformerEvaluator

Example Usage

Example Output

Initialization & Common Parameters

Example Usage — Using OpenAICompatibleLMInvoker