RAGEvaluator

Use when: You want a single evaluator that scores both retrieval and generation quality for a RAG pipeline, combining LM-based DeepEval retrieval metrics with G-Eval generation metrics into an overall RAG rating, score, and issue hints.

By default, RAGEvaluator runs the LM-based retrieval evaluator (contextual precision, contextual recall) and the GEval generation evaluator (completeness, redundancy, groundedness, language consistency, refusal alignment), then applies a rule engine to classify the end-to-end RAG response. The default rule is using GEval generation rule.

Fields:

  1. query (str) — The user question.

  2. expected_response (str) — The reference or ground truth answer.

  3. generated_response (str) — The model's generated answer to score.

  4. retrieved_context (str | list[str]) — The supporting context/documents used during retrieval. Strings are coerced into a single-element list.

Example Usage

import asyncio
import os
from gllm_evals.evaluator.rag_evaluator import RAGEvaluator
from gllm_evals.types import RAGData

async def main():
    evaluator = RAGEvaluator(model_credentials=os.getenv("GOOGLE_API_KEY"))

    data = RAGData(
        query="What is the capital of France?",
        expected_response="Paris is the capital of France.",
        generated_response="Paris is the capital of France.",
        retrieved_context=[
            "Paris is the capital city of France with a population of over 2 million people.",
            "Berlin is the capital of Germany.",
        ],
    )

    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Example Output

Last updated

Was this helpful?