🎯Evaluator / Scorer

An Evaluator orchestrates evaluation workflows by coordinating metrics and evaluation logic. It acts as a manager that:

Executes relevant metrics
Aggregates and formats results
Generates human-readable explanations

All evaluators inherit from BaseEvaluator, an abstract base class that defines the core evaluation interface.

Input & Output Types

Evaluator accepts dictionary containing data to evaluate which takes from MetricInput . There are also several data types that are already created such as QAData, RAGData, and AgentData .

Example Input

{
    "query": "What is the capital of France?",
    "retrieved_context": "Paris",
    "expected_response": "New York",
    "generated_response": "Paris is the capital of France."
}

While Evaluator outputs an EvaluationOutput that includes several keys such as global_explanation , score, and namespaced metrics result.

Example Output

{
    "generation": {
        "global_explanation": "The following metrics failed to meet expectations:\n1. Completeness is 1 (should be 3)\n2. Groundedness is 1 (should be 3)",
        "relevancy_rating": "bad",
        "score": 0.0,
        "possible_issues": ["Retrieval Issue", "Generation Issue"],
        "completeness": {
            "score": 1,
            "explanation": "The response contains a critical factual contradiction. It identifies Barcelona as the capital of Spain, whereas the expected output correctly states that the capital is Madrid.",
        },
        "groundedness": {
            "score": 1,
            "explanation": "The response provides a factually incorrect answer that directly contradicts the retrieval context, which explicitly states that Madrid is the capital of Spain.",
        },
    }
}

Single vs Batch Evaluation

Evaluators support both modes via the same evaluate() method:

Single Evaluation

result : EvaluationOutput = await evaluator.evaluate(data)

Batch Evaluation

results : list[EvaluationOutput]= await evaluator.evaluate([data1, data2, data3])

Initialization & Common Parameters

All evaluators accept:

model: str | BaseLMInvoker
- Use a string for quick setup (e.g., "openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"), or
- Pass a BaseLMInvoker instance for more advanced configuration. See Language Model (LM) Invoker for more details and supported invokers.

Example Usage — Using OpenAICompatibleLMInvoker

import asyncio
import os

from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.types import RAGData
from gllm_inference.lm_invoker import OpenAICompatibleLMInvoker


async def main():
    """Main function."""
    data = RAGData(
        query="What is the capital of France?",
        expected_response="Paris",
        generated_response="New York",
        retrieved_context="Paris is the capital of France.",
    )

    lm_invoker = OpenAICompatibleLMInvoker(
        base_url="https://abc-vllm.obrol.id/ ",
        model_name="Qwen/Qwen3-Next-80B-A3B-Instruct",
        api_key="abc123",
    )

    evaluator = GEvalGenerationEvaluator(model=lm_invoker)

    result = await evaluator.evaluate(data)
    print(result)


if __name__ == "__main__":
    asyncio.run(main())

Available Evaluators

Looking for something else? Build your own custom evaluator here.

^{*All fields are optional and can be adjusted depending on the chosen metric.}

PreviousEvaluate Helper Function NextGEvalGenerationEvaluator

Last updated 4 days ago

Was this helpful?

hashtagInput & Output Types

hashtagSingle vs Batch Evaluation

hashtagInitialization & Common Parameters

hashtagExample Usage — Using OpenAICompatibleLMInvoker

hashtagAvailable Evaluators

Input & Output Types

Single vs Batch Evaluation

Initialization & Common Parameters

Example Usage — Using OpenAICompatibleLMInvoker

Available Evaluators