🛠️Create Custom Evaluator / Scorer

If the built-in evaluators don’t cover your use case, you can define your own! There are two main ways to create a custom evaluator:

1. Combining Existing Metrics with CustomEvaluator

The easiest way to build your own evaluator is by combining any set of metrics into a CustomEvaluator. You can mix and match built-in metrics to tailor evaluation to your needs.

Example Usage

import asyncio
import os

from gllm_evals.evaluator.custom_evaluator import CustomEvaluator
from gllm_evals.metrics.generation.ragas_factual_correctness import RagasFactualCorrectness
from gllm_evals.metrics.retrieval.ragas_context_precision import RagasContextPrecisionWithoutReference
from gllm_evals.metrics.retrieval.ragas_context_recall import RagasContextRecall
from gllm_evals.types import RAGData

async def main():
    ragas_factual_correctness = RagasFactualCorrectness(
        lm_model="openai/gpt-4.1",
        lm_model_credentials=os.getenv("OPENAI_API_KEY"),
    )
    ragas_context_recall = RagasContextRecall(
        lm_model="openai/gpt-4.1",
        lm_model_credentials=os.getenv("OPENAI_API_KEY"),
    )

    evaluator = CustomEvaluator(
        metrics=[ragas_factual_correctness, ragas_context_recall, ragas_context_precision],
        name="my_evaluator",
    )

    data = RAGData(
        query="When was the first super bowl?",
        generated_response="The first superbowl was held on Jan 15, 1967",
        retrieved_contexts=[
            "The First AFL-NFL World Championship Game was an American football game played on January 15, 1967, "
            "at the Los Angeles Memorial Coliseum in Los Angeles."
        ],
        expected_response="The first superbowl was held on Jan 15, 1967",
    )

    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "my_evaluator": {
    "factual_correctness": {
      "score": 1.0,
      "explanation": null
    },
    "context_recall": {
      "score": 1.0,
      "explanation": null
    }
  }
}

2. Extend BaseEvaluator for Full Control

If you need a highly customized evaluation logic, you can create your own class by extending BaseEvaluator and defining your evaluation logic from scratch.

Example Usage

import asyncio

from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.evaluator.evaluator import BaseEvaluator
from gllm_evals.types import MetricInput, MetricOutput, EvaluationOutput, QAData

class ExactMatchMetric(BaseMetric):
    def __init__(self):
        self.name = "exact_match"

    async def _evaluate(self, data: MetricInput) -> MetricOutput:
        score = int(data["generated_response"] == data["expected_response"])
        return {"score": score}

class ResponseEvaluator(BaseEvaluator):
    def __init__(self):
        super().__init__(name="response_evaluator")
        self.metric = ExactMatchMetric()

    async def _evaluate(self, data: MetricInput) -> EvaluationOutput:
        return await self.metric.evaluate(data)

async def main():
    data = QAData(
        query="What is the capital of France?",
        generated_response="The capital of France is Paris.",
        expected_response="The capital of France is Paris.",
    )
    evaluator = ResponseEvaluator()
    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "response_evals": {
    "exact_match": {
      "score": 1
    }
  }
}

Which Method Should You Use?

Use Case

Recommended Method

Combine existing metrics

CustomEvaluator

Implement custom logic

Extend BaseEvaluator

Congratulations! You have successfully create your own custom evaluator

PreviousEvaluator / Scorer Configuration NextMetric

Last updated 5 months ago

Was this helpful?

hashtag1. Combining Existing Metrics with CustomEvaluator

hashtagExample Usage

hashtagExample Output

hashtag2. Extend BaseEvaluator for Full Control

hashtagExample Usage

hashtagExample Output

hashtagWhich Method Should You Use?

1. Combining Existing Metrics with CustomEvaluator

Example Usage

Example Output

2. Extend BaseEvaluator for Full Control

Example Usage

Example Output

Which Method Should You Use?