SummarizationEvaluator

Use when: You want to evaluate the quality of a generated summary against its source text (e.g., meeting transcripts).

By default, SummarizationEvaluator runs four metrics: coherence, consistency, relevance, and fluency.

  1. Coherence: GEval summarization coherence score. The score is between 1 and 3. It assesses whether the summary is logically organized, flows smoothly, and maintains clear semantic links across sections. 1 means fragmented flow, 2 means mostly coherent with partial disconnects, and 3 means fully coherent. It needs input and summary to work.

  2. Consistency: GEval summarization consistency score. The score is between 1 and 3. It assesses factual alignment between summary claims and the source transcript without unsupported additions. 1 means hallucinations dominate, 2 means some hallucinations present, and 3 means zero hallucinations. It needs input and summary to work.

  3. Relevance: GEval summarization relevance score. The score is between 1 and 3. It assesses how completely and how focused the summary captures important information from the source transcript. 1 means major omissions, 2 means partial coverage, and 3 means complete and focused coverage. It needs input and summary to work.

  4. Fluency: GEval summarization fluency score. The score is between 1 and 3. It assesses readability, naturalness, grammar, and clarity of the summary text. 1 means major readability problems, 2 means minor language issues, and 3 means clear and natural language. It needs input and summary to work.

Fields:

  1. input (str) — The source text to be summarized (e.g., a meeting transcript).

  2. summary (str) — The generated summary to be evaluated.

Output:

SummarizationEvaluator results score for each metric that is enabled with their explanation over the scoring system. Additionally SummarizationEvaluator also provides the normalized_score that ranges between 0 - 1. In aggregation, there are 3 scores provided, thus:

  • score: The default score that is determined from the rule engine based on relevancy_rating

    • bad: 0

    • incomplete: 0.5

    • good: 1

  • binary_score: This score results 1 when the result is good, and results 0 for other relevancy_rating class.

  • avg_score: This score averages the normalized_score for each metric that is enabled.

The rule engine determines the relevancy_rating as follows:

  • good: coherence >= 2 AND consistency >= 3 AND relevance >= 3 AND fluency >= 2

  • bad: coherence <= 1 OR consistency <= 1 OR relevance <= 1 OR fluency <= 1

  • incomplete: anything in between

Example Usage

Or you can provide the data directly

Example Output

Enabling Specific Metrics

You can enable only a subset of the four metrics:

When specific metrics are enabled, the rule book automatically adjusts to only consider those metrics for the relevancy_rating determination.

Last updated

Was this helpful?