🛠️Create Custom Evaluator / Scorer

If the built-in evaluators don’t cover your use case, you can define your own! There are two main ways to create a custom evaluator:

1. Implement Custom Metric and Custom Evaluator

If you need a highly customized evaluation logic, you can create your own class by extending BaseEvaluator and defining your evaluation logic from scratch.

Example Usage

import asyncio

from gllm_evals.dataset import load_simple_rag_dataset
from gllm_evals.evaluator.evaluator import BaseEvaluator
from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.types import MetricInput, MetricOutput, EvaluationOutput

class ExactMatchMetric(BaseMetric):
    def __init__(self):
        self.name = "exact_match"

    async def _evaluate(self, data: MetricInput) -> MetricOutput:
        score = int(data["generated_response"] == data["expected_response"])
        return {"score": score, "explanation": None}

class ResponseEvaluator(BaseEvaluator):
    def __init__(self):
        super().__init__(name="response_evaluator")
        self.metric = ExactMatchMetric()

    async def _evaluate(self, data: MetricInput) -> EvaluationOutput:
        return await self.metric.evaluate(data)

async def main():
    evaluator = ResponseEvaluator()
    data = load_simple_rag_dataset()
    result = await evaluator.evaluate(data[0])
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "response_evaluator": {
    "global_explanation": "All metrics met the expected values.",
    "exact_match": {
      "score": 1, 
      "explanation": None
    },
  }
}

2. Combining Existing Metrics with CustomEvaluator

The easiest way to build your own evaluator is by combining any set of metrics into a CustomEvaluator. You can mix and match built-in metrics to tailor evaluation to your needs.

Example Usage

import asyncio
import os

from gllm_evals.dataset import load_simple_rag_dataset
from gllm_evals.evaluator.custom_evaluator import CustomEvaluator
from gllm_evals.metrics.retrieval.ragas_context_precision import RagasContextPrecisionWithoutReference
from gllm_evals.metrics.retrieval.ragas_context_recall import RagasContextRecall


async def main():
    ragas_context_recall = RagasContextRecall(
        lm_model="openai/gpt-4.1",
        lm_model_credentials=os.getenv("OPENAI_API_KEY"),
    )
    ragas_context_precision = RagasContextPrecisionWithoutReference(
        lm_model="openai/gpt-4.1",
        lm_model_credentials=os.getenv("OPENAI_API_KEY"),
    )
    evaluator = CustomEvaluator(
        metrics=[ragas_context_recall, ragas_context_precision],
        name="my_evaluator",
    )

    data = load_simple_rag_dataset()
    result = await evaluator.evaluate(data[0])
    print(result)


if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "my_evaluator": {
    "global_explanation": "The following metrics failed to meet expectations:\n1. Llm Context Precision Without Reference is 0.9999999999 (should be 1)",
    "context_recall": {
      "score": 1.0, 
      "explanation": None
    },
    "llm_context_precision_without_reference": {
      "score": 0.9999999999,
      "explanation": None,
    },
  }
}

Which Method Should You Use?

Use Case

Recommended Method

Implement custom logic

Extend BaseEvaluator

Combine existing metrics

CustomEvaluator

Congratulations! You have successfully create your own custom evaluator

PreviousQueryTransformerEvaluator NextMetric

Last updated 6 days ago

Was this helpful?

hashtag1. Implement Custom Metric and Custom Evaluator

hashtagExample Usage

hashtagExample Output

hashtag2. Combining Existing Metrics with CustomEvaluator

hashtagExample Usage

hashtagExample Output

hashtagWhich Method Should You Use?

1. Implement Custom Metric and Custom Evaluator

Example Usage

Example Output

2. Combining Existing Metrics with CustomEvaluator

Example Usage

Example Output

Which Method Should You Use?