LMBasedRetrievalEvaluator

Use when: You want to evaluate the retrieval step of a RAG pipeline with LM-based metrics, combining their scores into a simple relevancy rating, final score, and issue hints.

By default, LMBasedRetrievalEvaluator runs two metrics: contextual precision and contextual recall, then applies a rule engine to classify the retrieval quality.

Contextual Precision: DeepEval contextual precision. Scores range from 0 to 1. It checks whether relevant context is ranked above irrelevant context for the given query and expected answer. Needs query, expected_response, and retrieved_context.
Contextual Recall: DeepEval contextual recall. Scores range from 0 to 1. It measures how well the retrieved context aligns with the expected answer. Needs query, expected_response, and retrieved_context. The default rule engine uses this metric to determine the retrieval relevancy rating (good / bad).

Fields:

query (str) — The user question.
expected_response (str) — The reference or ground truth answer.
retrieved_context (str | list[str]) — The supporting context/documents used during retrieval. Strings are coerced into a single-element list.

Example Usage

import asyncio
import os
from gllm_evals.evaluator.lm_based_retrieval_evaluator import LMBasedRetrievalEvaluator
from gllm_evals.types import RAGData

async def main():
    evaluator = LMBasedRetrievalEvaluator(model_credentials=os.getenv("GOOGLE_API_KEY"))

    data = RAGData(
        query="What is the capital of France?",
        expected_response="Paris is the capital of France.",
        retrieved_context=[
            "Berlin is the capital of Germany.",
            "Paris is the capital city of France with a population of over 2 million people.",
            "London is the capital of the United Kingdom.",
        ],
    )

    result = await evaluator.evaluate(data)
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "lm_based_retrieval": {
    "global_explanation": "All metrics met the expected values.",
    "relevancy_rating": "good",
    "score": 1.0,
    "possible_issues": [],
    "deepeval_contextual_precision": {
      "score": 0.5,
      "explanation": "The score is 0.50 because the most relevant node is not ranked first. The first node in the retrieval contexts is irrelevant as it 'refers to the capital of Germany, not France.' The score is not lower because the second node correctly 'directly answers the question by stating, 'Paris is the capital city of France...''",
      "success": true
    },
    "deepeval_contextual_recall": {
      "score": 1.0,
      "explanation": "The score is 1.00. Fantastic! The expected output is fully supported by the information within node #2 in the retrieval context.",
      "success": true
    }
  }
}

PreviousClassicalRetrievalEvaluator NextRAGEvaluator

Last updated 8 days ago

Was this helpful?

hashtagExample Usage

hashtagExample Output

Example Usage

Example Output