GEvalGenerationEvaluator

Use when: You want to evaluate the response/answer of a QnA system. This includes general chatbots, RAG systems, or agents that answer specific questions. The focus of this evaluator is on assessing the quality of the answer provided by the QnA system.

By default, GEvalGenerationEvaluator runs five metrics: completeness, groundedness, redundancy, language consistency, and refusal alignment.

  1. Completeness: This is deepeval's g-eval completeness score. The score range is between 1 and 3. 1 means not complete, 2 means incomplete, and 3 means complete. It needs query, generated_response, and expected_response to work.

  2. Redundancy: This is deepeval g-eval redundancy score. The score range is between 1 and 3. 1 means no redundancy, 2 at least one redundancy, and 3 means high redundancy. It needs query and generated_response to work.

  3. Groundedness: This is deepeval g-eval groundedness score. The score is between 1 and 3. 1 means not grounded, 2 means at least one grounded, and 3 means fully grounded. It needs query, generated_response, and retrieved_context to work.

  4. Language Consistency: This is deepeval g-eval language consistency score. The score is between 0 and 1. 0 means not consistent, 1 means fully consistent. It needs query and generated_response to work.

  5. Refusal Alignment: This is deepeval g-eval refusal alignment score. The score is between 0 and 1. 1 indicates correct alignment (both are refusal or both are not refusal), 0 indicates incorrect alignment (one is refusal, the other is not). It needs query, generated_response, and expected_response to work.

Fields:

  1. query (str) — The user question.

  2. generated_response (str) — The model's output to be evaluated.

  3. expected_response (str, optional) — The reference or ground truth answer.

  4. retrieved_context (str, optional) — The supporting context/documents used during generation.

Output

GEvalGenerationEvaluator results score for each metrics that are enabled with their explanation over the scoring system. Additionally GEvalGenerationEvaluator also provides the normalized_score that ranges between 0 - 1. In aggregation, there are 3 scores provided, thus:

  • score: The default score that are considered from a rule of three class of relevancy_rating

    • bad: 0

    • incomplete: 0.5

    • good: 1

  • binary_score: This score results 1 when the result is good, and results 0 for other relevancy_rating class

  • avg_score: This score averages the normalized_score for each metrics that are enabled.

Example Usage

Example Output

Last updated

Was this helpful?