RAGEvaluator

Use when: You want a single evaluator that scores both retrieval and generation quality for a RAG pipeline, combining LM-based DeepEval retrieval metrics with G-Eval generation metrics into an overall RAG rating, score, and issue hints.

By default, RAGEvaluator runs the LM-based retrieval evaluator (contextual precision, contextual recall) and the GEval generation evaluator (completeness, redundancy, groundedness, language consistency, refusal alignment), then applies a rule engine to classify the end-to-end RAG response. The default rule is using GEval generation rule.

Fields:

query (str) — The user question.
expected_response (str) — The reference or ground truth answer.
generated_response (str) — The model's generated answer to score.
retrieved_context (str | list[str]) — The supporting context/documents used during retrieval. Strings are coerced into a single-element list.

Example Usage

import asyncio
import os
from gllm_evals.evaluator.rag_evaluator import RAGEvaluator
from gllm_evals.types import RAGData

async def main():
    evaluator = RAGEvaluator(model_credentials=os.getenv("GOOGLE_API_KEY"))

    data = RAGData(
        query="What is the capital of France?",
        expected_response="Paris is the capital of France.",
        generated_response="Paris is the capital of France.",
        retrieved_context=[
            "Paris is the capital city of France with a population of over 2 million people.",
            "Berlin is the capital of Germany.",
        ],
    )

    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "rag": {
    "global_explanation": "Retrieval: All metrics met the expected values.\nGeneration: All metrics met the expected values.",
    "lm_based_retrieval": {
      "global_explanation": "All metrics met the expected values.",
      "relevancy_rating": "good",
      "score": 1.0,
      "possible_issues": [],
      "deepeval_contextual_precision": {
        "score": 1.0,
        "explanation": "The score is 1.00 because the first node in the retrieval contexts is perfectly positioned as it 'directly provides the answer by stating 'Paris is the capital city of France'.', while the irrelevant second node which states 'The statement 'Berlin is the capital of Germany' is irrelevant to the question regarding the capital of France.' is correctly ranked lower. Exceptional ranking performance!",
        "success": True,
      },
      "deepeval_contextual_recall": {
        "score": 1.0,
        "explanation": "The score is 1.00 because sentence 1 is perfectly validated by node 1 in retrieval context, making this a flawless match!",
        "success": True,
      },
    },
    "generation": {
      "global_explanation": "All metrics met the expected values.",
      "relevancy_rating": "good",
      "score": 1.0,
      "possible_issues": [],
      "binary_score": 1,
      "avg_score": 1.0,
      "completeness": {
        "score": 3,
        "explanation": "The generated output is identical to the expected output, accurately identifying Paris as the capital of France. It fully addresses the question with complete semantic accuracy and no contradictions.",
        "success": True,
        "normalized_score": 1.0,
      },
      "groundedness": {
        "score": 3,
        "explanation": "The output accurately identifies Paris as the capital of France, which is directly supported by the first statement in the retrieval context.",
        "success": True,
        "normalized_score": 1.0,
      },
      "redundancy": {
        "score": 1,
        "explanation": "The response is direct and concise, providing the answer in a single sentence without any repetition of words or ideas.",
        "normalized_score": 1.0,
      },
      "language_consistency": {
        "score": 1,
        "explanation": "The instructional language of the input is English, and the actual output is also written entirely in English, demonstrating full consistency between the request and the response.",
        "success": True,
        "normalized_score": 1.0,
      },
      "refusal_alignment": {
        "score": 1,
        "explanation": "is_refusal was detected from expected response as not a refusal because it directly answers the factual query. The actual output also directly provides the answer without any refusal indicators. Both are not refusal, so the alignment is correct.",
        "success": True,
        "normalized_score": 1.0,
      },
    },
    "relevancy_rating": "good",
    "score": 1.0,
    "possible_issues": [],
  }
}

PreviousLMBasedRetrievalEvaluator NextQueryTransformerEvaluator

Last updated 4 days ago

Was this helpful?

hashtagExample Usage

hashtagExample Output

Example Usage

Example Output