πŸ“ŽCreate Custom Evaluator

If the built-in evaluators don’t cover your use case, you can define your own! There are two main ways to create a custom evaluator:

1. Combining Existing Metrics with CustomEvaluator

The easiest way to build your own evaluator is by combining any set of metrics into a CustomEvaluator. You can mix and match built-in metrics to tailor evaluation to your needs.

Example

import asyncio
import os

from gllm_evals.evaluator.custom_evaluator import CustomEvaluator
from gllm_evals.metrics.generation.ragas_factual_correctness import RagasFactualCorrectness
from gllm_evals.metrics.retrieval.ragas_context_precision import RagasContextPrecisionWithoutReference
from gllm_evals.metrics.retrieval.ragas_context_recall import RagasContextRecall
from gllm_evals.types import RAGData

async def main():
    ragas_factual_correctness = RagasFactualCorrectness(
        lm_model="openai/gpt-4.1",
        lm_model_credentials=os.getenv("OPENAI_API_KEY"),
    )
    ragas_context_recall = RagasContextRecall(
        lm_model="openai/gpt-4.1",
        lm_model_credentials=os.getenv("OPENAI_API_KEY"),
    )
    ragas_context_precision = RagasContextPrecisionWithoutReference(
        lm_model="openai/gpt-4.1",
        lm_model_credentials=os.getenv("OPENAI_API_KEY"),
    )

    evaluator = CustomEvaluator(
        metrics=[ragas_factual_correctness, ragas_context_recall, ragas_context_precision],
        name="my_evaluator",
    )

    data = RAGData(
        query="When was the first super bowl?",
        generated_response="The first superbowl was held on Jan 15, 1967",
        retrieved_contexts=[
            "The First AFL-NFL World Championship Game was an American football game played on January 15, 1967, "
            "at the Los Angeles Memorial Coliseum in Los Angeles."
        ],
        expected_response="The first superbowl was held on Jan 15, 1967",
    )

    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Available Metrics

Here are the available metrics you can use with CustomEvaluator:

2. Extend BaseEvaluator for Full Control

If you need a highly customized evaluation logic, you can create your own class by extending BaseEvaluator and defining your evaluation logic from scratch.

Example

import asyncio

from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.evaluator.evaluator import BaseEvaluator
from gllm_evals.types import MetricInput, MetricOutput, EvaluationOutput, QAData

class ExactMatchMetric(BaseMetric):
    def __init__(self):
        self.name = "exact_match"

    async def _evaluate(self, data: MetricInput) -> MetricOutput:
        score = int(data["generated_response"] == data["expected_response"])
        return {"score": score}

class ResponseEvaluator(BaseEvaluator):
    def __init__(self):
        super().__init__(name="response_evaluator")
        self.metric = ExactMatchMetric()

    async def _evaluate(self, data: MetricInput) -> EvaluationOutput:
        return await self.metric.evaluate(data)

async def main():
    data = QAData(
        query="What is the capital of France?",
        generated_response="The capital of France is Paris.",
        expected_response="The capital of France is Paris.",
    )
    evaluator = ResponseEvaluator()
    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Which Method Should You Use?

Use Case
Recommended Method

Combine existing metrics

CustomEvaluator

Implement custom logic

Extend BaseEvaluator

Last updated