🎯Evaluator / Scorer

An Evaluator orchestrates evaluation workflows by coordinating metrics and evaluation logic. It acts as a manager that:

  • Executes relevant metrics

  • Aggregates and formats results

  • Generates human-readable explanations

All evaluators inherit from BaseEvaluator, an abstract base class that defines the core evaluation interface.

Input & Output Types

Evaluator accepts dictionary containing data to evaluate which takes from MetricInput . There are also several data types that are already created such as QAData, RAGData, and AgentData .

Example Input

{
    "query": "What is the capital of France?",
    "retrieved_context": "Paris",
    "expected_response": "New York",
    "generated_response": "Paris is the capital of France."
}

While Evaluator outputs an EvaluationOutput that includes several keys such as global_explanation , score, and namespaced metrics result.

Example Output

{
    "generation": {
        "global_explanation": "The following metrics failed to meet expectations:\n1. Completeness is 1 (should be 3)\n2. Groundedness is 1 (should be 3)",
        "relevancy_rating": "bad",
        "score": 0.0,
        "possible_issues": ["Retrieval Issue", "Generation Issue"],
        "completeness": {
            "score": 1,
            "explanation": "The response contains a critical factual contradiction. It identifies Barcelona as the capital of Spain, whereas the expected output correctly states that the capital is Madrid.",
        },
        "groundedness": {
            "score": 1,
            "explanation": "The response provides a factually incorrect answer that directly contradicts the retrieval context, which explicitly states that Madrid is the capital of Spain.",
        },
    }
}

Single vs Batch Evaluation

Evaluators support both modes via the same evaluate() method:

Single Evaluation

Batch Evaluation

Initialization & Common Parameters

All evaluators accept:

  • model: str | BaseLMInvoker

    • Use a string for quick setup (e.g., "openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet"), or

    • Pass a BaseLMInvoker instance for more advanced configuration. See Language Model (LM) Invokerarrow-up-right for more details and supported invokers.

Example Usage β€” Using OpenAICompatibleLMInvoker

Available Evaluators


Looking for something else? Build your own custom evaluator here.

*All fields are optional and can be adjusted depending on the chosen metric.

Last updated

Was this helpful?