📎Create Custom Evaluator

If the built-in evaluators don’t cover your use case, you can define your own! There are two main ways to create a custom evaluator:

1. Combining Existing Metrics with CustomEvaluator

The easiest way to build your own evaluator is by combining any set of metrics into a CustomEvaluator. You can mix and match built-in metrics to tailor evaluation to your needs.

Example

import asyncio
import os

from gllm_evals.evaluator.custom_evaluator import CustomEvaluator
from gllm_evals.metrics.generation.ragas_factual_correctness import RagasFactualCorrectness
from gllm_evals.metrics.retrieval.ragas_context_precision import RagasContextPrecisionWithoutReference
from gllm_evals.metrics.retrieval.ragas_context_recall import RagasContextRecall
from gllm_evals.types import RAGData

async def main():
    ragas_factual_correctness = RagasFactualCorrectness(
        lm_model="openai/gpt-4.1",
        lm_model_credentials=os.getenv("OPENAI_API_KEY"),
    )
    ragas_context_recall = RagasContextRecall(
        lm_model="openai/gpt-4.1",
        lm_model_credentials=os.getenv("OPENAI_API_KEY"),
    )
    ragas_context_precision = RagasContextPrecisionWithoutReference(
        lm_model="openai/gpt-4.1",
        lm_model_credentials=os.getenv("OPENAI_API_KEY"),
    )

    evaluator = CustomEvaluator(
        metrics=[ragas_factual_correctness, ragas_context_recall, ragas_context_precision],
        name="my_evaluator",
    )

    data = RAGData(
        query="When was the first super bowl?",
        generated_response="The first superbowl was held on Jan 15, 1967",
        retrieved_contexts=[
            "The First AFL-NFL World Championship Game was an American football game played on January 15, 1967, "
            "at the Los Angeles Memorial Coliseum in Los Angeles."
        ],
        expected_response="The first superbowl was held on Jan 15, 1967",
    )

    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Available Metrics

Here are the available metrics you can use with CustomEvaluator:

Generation Evaluation Metrics
Retrieval Evaluation Metrics

2. Extend BaseEvaluator for Full Control

If you need a highly customized evaluation logic, you can create your own class by extending BaseEvaluator and defining your evaluation logic from scratch.

Example

import asyncio

from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.evaluator.evaluator import BaseEvaluator
from gllm_evals.types import MetricInput, MetricOutput, EvaluationOutput, QAData

class ExactMatchMetric(BaseMetric):
    def __init__(self):
        self.name = "exact_match"

    async def _evaluate(self, data: MetricInput) -> MetricOutput:
        score = int(data["generated_response"] == data["expected_response"])
        return {"score": score}

class ResponseEvaluator(BaseEvaluator):
    def __init__(self):
        super().__init__(name="response_evaluator")
        self.metric = ExactMatchMetric()

    async def _evaluate(self, data: MetricInput) -> EvaluationOutput:
        return await self.metric.evaluate(data)

async def main():
    data = QAData(
        query="What is the capital of France?",
        generated_response="The capital of France is Paris.",
        expected_response="The capital of France is Paris.",
    )
    evaluator = ResponseEvaluator()
    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Which Method Should You Use?

Use Case

Recommended Method

Combine existing metrics

CustomEvaluator

Implement custom logic

Extend BaseEvaluator

Congratulations! You have successfully create your own custom evaluator

PreviousRunning an Evaluation NextContributing

Last updated 2 days ago