GEvalGenerationEvaluator

Use when: You want to evaluate the response/answer of a QnA system. This includes general chatbots, RAG systems, or agents that answer specific questions. The focus of this evaluator is on assessing the quality of the answer provided by the QnA system.

By default, GEvalGenerationEvaluator runs five metrics: completeness, groundedness, redundancy, language consistency, and refusal alignment.

Completeness: This is deepeval's g-eval completeness score. The score range is between 1 and 3. 1 means not complete, 2 means incomplete, and 3 means complete. It needs query, generated_response, and expected_response to work.
Redundancy: This is deepeval g-eval redundancy score. The score range is between 1 and 3. 1 means no redundancy, 2 at least one redundancy, and 3 means high redundancy. It needs query and generated_response to work.
Groundedness: This is deepeval g-eval groundedness score. The score is between 1 and 3. 1 means not grounded, 2 means at least one grounded, and 3 means fully grounded. It needs query, generated_response, and retrieved_context to work.
Language Consistency: This is deepeval g-eval language consistency score. The score is between 0 and 1. 0 means not consistent, 1 means fully consistent. It needs query and generated_response to work.
Refusal Alignment: This is deepeval g-eval refusal alignment score. The score is between 0 and 1. 1 indicates correct alignment (both are refusal or both are not refusal), 0 indicates incorrect alignment (one is refusal, the other is not). It needs query, generated_response, and expected_response to work.

Fields:

query (str) — The user question.
generated_response (str) — The model's output to be evaluated.
expected_response (str, optional) — The reference or ground truth answer.
retrieved_context (str, optional) — The supporting context/documents used during generation.

Output

GEvalGenerationEvaluator results score for each metrics that are enabled with their explanation over the scoring system. Additionally GEvalGenerationEvaluator also provides the normalized_score that ranges between 0 - 1. In aggregation, there are 3 scores provided, thus:

score: The default score that are considered from a rule of three class of relevancy_rating
- bad: 0
- incomplete: 0.5
- good: 1
binary_score: This score results 1 when the result is good, and results 0 for other relevancy_rating class
avg_score: This score averages the normalized_score for each metrics that are enabled.

Example Usage

import asyncio
import os

from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.types import RAGData


async def main():
    """Main function."""
    data = RAGData(
        query="What is the capital of France?",
        expected_response="Paris",
        generated_response="New York",
        retrieved_context="Paris is the capital of France.",
    )

    evaluator = GEvalGenerationEvaluator(model_credentials=os.getenv("GOOGLE_API_KEY"))

    result = await evaluator.evaluate(data)
    print(result)


if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "generation": {
    "global_explanation": "The following metrics failed to meet expectations:\n1. Completeness is 1 (should be 3)\n2. Groundedness is 1 (should be 3)",
    "relevancy_rating": "bad",
    "score": 0.0,
    "possible_issues": ["Retrieval Issue", "Generation Issue"],
    "binary_score": 0,
    "avg_score": 0.6,
    "completeness": {
      "score": 1,
      "explanation": "The output contains a critical factual contradiction, incorrectly identifying New York as the capital of France instead of Paris.",
      "normalized_score": 0.0,
    },
    "groundedness": {
      "score": 1,
      "explanation": "The output directly contradicts the provided retrieval context, which explicitly states that Paris is the capital of France. Identifying New York as the capital is a major factual error and is completely unsupported by the given text.",
      "normalized_score": 0.0,
    },
    "redundancy": {
      "score": 1,
      "explanation": "The response is extremely concise, providing a single answer without any repetition of words, phrases, or ideas. Each point is presented only once.",
      "normalized_score": 1.0,
    },
    "language_consistency": {
      "score": 1,
      "explanation": "The instructional language of the input is English, and the actual output is also written in English, maintaining consistency regardless of the factual accuracy of the content.",
      "success": True,
      "normalized_score": 1.0,
    },
    "refusal_alignment": {
      "score": 1,
      "explanation": "The expected refusal status was detected as not refusal because 'Paris' directly answers the query. The actual output 'New York' is also not a refusal; although factually incorrect, it provides a substantive answer without refusal indicators or policy references. Since both the expected and actual responses are not refusals, the alignment is correct.",
      "success": True,
      "normalized_score": 1.0,
    },
  }
}

PreviousEvaluator / Scorer NextAgentEvaluator

Last updated 5 days ago

Was this helpful?

hashtagFields:

hashtagOutput

hashtagExample Usage

hashtagExample Output

Fields:

Output

Example Usage

Example Output