πCreate Custom Evaluator
If the built-in evaluators donβt cover your use case, you can define your own! There are two main ways to create a custom evaluator:
1. Combining Existing Metrics with CustomEvaluator
The easiest way to build your own evaluator is by combining any set of metrics into a CustomEvaluator
. You can mix and match built-in metrics to tailor evaluation to your needs.
Example
import asyncio
import os
from gllm_evals.evaluator.custom_evaluator import CustomEvaluator
from gllm_evals.metrics.generation.ragas_factual_correctness import RagasFactualCorrectness
from gllm_evals.metrics.retrieval.ragas_context_precision import RagasContextPrecisionWithoutReference
from gllm_evals.metrics.retrieval.ragas_context_recall import RagasContextRecall
from gllm_evals.types import RAGData
async def main():
ragas_factual_correctness = RagasFactualCorrectness(
lm_model="openai/gpt-4.1",
lm_model_credentials=os.getenv("OPENAI_API_KEY"),
)
ragas_context_recall = RagasContextRecall(
lm_model="openai/gpt-4.1",
lm_model_credentials=os.getenv("OPENAI_API_KEY"),
)
ragas_context_precision = RagasContextPrecisionWithoutReference(
lm_model="openai/gpt-4.1",
lm_model_credentials=os.getenv("OPENAI_API_KEY"),
)
evaluator = CustomEvaluator(
metrics=[ragas_factual_correctness, ragas_context_recall, ragas_context_precision],
name="my_evaluator",
)
data = RAGData(
query="When was the first super bowl?",
generated_response="The first superbowl was held on Jan 15, 1967",
retrieved_contexts=[
"The First AFL-NFL World Championship Game was an American football game played on January 15, 1967, "
"at the Los Angeles Memorial Coliseum in Los Angeles."
],
expected_response="The first superbowl was held on Jan 15, 1967",
)
result = await evaluator.evaluate(data)
print(result)
if __name__ == "__main__":
asyncio.run(main())
Available Metrics
Here are the available metrics you can use with CustomEvaluator
:
Generation Evaluation Metrics
Retrieval Evaluation Metrics
2. Extend BaseEvaluator for Full Control
If you need a highly customized evaluation logic, you can create your own class by extending BaseEvaluator
and defining your evaluation logic from scratch.
Example
import asyncio
from gllm_evals.metrics.metric import BaseMetric
from gllm_evals.evaluator.evaluator import BaseEvaluator
from gllm_evals.types import MetricInput, MetricOutput, EvaluationOutput, QAData
class ExactMatchMetric(BaseMetric):
def __init__(self):
self.name = "exact_match"
async def _evaluate(self, data: MetricInput) -> MetricOutput:
score = int(data["generated_response"] == data["expected_response"])
return {"score": score}
class ResponseEvaluator(BaseEvaluator):
def __init__(self):
super().__init__(name="response_evaluator")
self.metric = ExactMatchMetric()
async def _evaluate(self, data: MetricInput) -> EvaluationOutput:
return await self.metric.evaluate(data)
async def main():
data = QAData(
query="What is the capital of France?",
generated_response="The capital of France is Paris.",
expected_response="The capital of France is Paris.",
)
evaluator = ResponseEvaluator()
result = await evaluator.evaluate(data)
print(result)
if __name__ == "__main__":
asyncio.run(main())
Which Method Should You Use?
Combine existing metrics
CustomEvaluator
Implement custom logic
Extend BaseEvaluator
Congratulations! You have successfully create your own custom evaluator
Last updated