SummarizationEvaluator
Use when: You want to evaluate the quality of a generated summary against its source text (e.g., meeting transcripts).
By default, SummarizationEvaluator runs four metrics: coherence, consistency, relevance, and fluency.
Coherence: GEval summarization coherence score. The score is between 1 and 3. It assesses whether the summary is logically organized, flows smoothly, and maintains clear semantic links across sections. 1 means fragmented flow, 2 means mostly coherent with partial disconnects, and 3 means fully coherent. It needs input and summary to work.
Consistency: GEval summarization consistency score. The score is between 1 and 3. It assesses factual alignment between summary claims and the source transcript without unsupported additions. 1 means hallucinations dominate, 2 means some hallucinations present, and 3 means zero hallucinations. It needs input and summary to work.
Relevance: GEval summarization relevance score. The score is between 1 and 3. It assesses how completely and how focused the summary captures important information from the source transcript. 1 means major omissions, 2 means partial coverage, and 3 means complete and focused coverage. It needs input and summary to work.
Fluency: GEval summarization fluency score. The score is between 1 and 3. It assesses readability, naturalness, grammar, and clarity of the summary text. 1 means major readability problems, 2 means minor language issues, and 3 means clear and natural language. It needs input and summary to work.
Fields:
input (str) — The source text to be summarized (e.g., a meeting transcript).
summary (str) — The generated summary to be evaluated.
Output:
SummarizationEvaluator results score for each metric that is enabled with their explanation over the scoring system. Additionally SummarizationEvaluator also provides the normalized_score that ranges between 0 - 1. In aggregation, there are 3 scores provided, thus:
score: The default score that is determined from the rule engine based on
relevancy_ratingbad: 0
incomplete: 0.5
good: 1
binary_score: This score results 1 when the result is good, and results 0 for other
relevancy_ratingclass.avg_score: This score averages the
normalized_scorefor each metric that is enabled.
The rule engine determines the relevancy_rating as follows:
good: coherence >= 2 AND consistency >= 3 AND relevance >= 3 AND fluency >= 2
bad: coherence <= 1 OR consistency <= 1 OR relevance <= 1 OR fluency <= 1
incomplete: anything in between
Example Usage
Or you can provide the data directly
Example Output
Enabling Specific Metrics
You can enable only a subset of the four metrics:
When specific metrics are enabled, the rule book automatically adjusts to only consider those metrics for the relevancy_rating determination.
Last updated
Was this helpful?