LMBasedRetrievalEvaluator

Use when: You want to evaluate the retrieval step of a RAG pipeline with LM-based metrics, combining their scores into a simple relevancy rating, final score, and issue hints.

By default, LMBasedRetrievalEvaluator runs two metrics: contextual precision and contextual recall, then applies a rule engine to classify the retrieval quality.

  1. Contextual Precision: DeepEval contextual precision. Scores range from 0 to 1. It checks whether relevant context is ranked above irrelevant context for the given query and expected answer. Needs query, expected_response, and retrieved_context.

  2. Contextual Recall: DeepEval contextual recall. Scores range from 0 to 1. It measures how well the retrieved context aligns with the expected answer. Needs query, expected_response, and retrieved_context. The default rule engine uses this metric to determine the retrieval relevancy rating (good / bad).

Fields:

  1. query (str) — The user question.

  2. expected_response (str) — The reference or ground truth answer.

  3. retrieved_context (str | list[str]) — The supporting context/documents used during retrieval. Strings are coerced into a single-element list.

Example Usage

import asyncio
import os
from gllm_evals.evaluator.lm_based_retrieval_evaluator import LMBasedRetrievalEvaluator
from gllm_evals.types import RAGData

async def main():
    evaluator = LMBasedRetrievalEvaluator(model_credentials=os.getenv("GOOGLE_API_KEY"))

    data = RAGData(
        query="What is the capital of France?",
        expected_response="Paris is the capital of France.",
        retrieved_context=[
            "Berlin is the capital of Germany.",
            "Paris is the capital city of France with a population of over 2 million people.",
            "London is the capital of the United Kingdom.",
        ],
    )

    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Example Output

Last updated

Was this helpful?