SummarizationEvaluator

Use when: You want to evaluate the quality of a generated summary against its source text (e.g., meeting transcripts).

By default, SummarizationEvaluator runs four metrics: coherence, consistency, relevance, and fluency.

Coherence: GEval summarization coherence score. The score is between 1 and 3. It assesses whether the summary is logically organized, flows smoothly, and maintains clear semantic links across sections. 1 means fragmented flow, 2 means mostly coherent with partial disconnects, and 3 means fully coherent. It needs input and summary to work.
Consistency: GEval summarization consistency score. The score is between 1 and 3. It assesses factual alignment between summary claims and the source transcript without unsupported additions. 1 means hallucinations dominate, 2 means some hallucinations present, and 3 means zero hallucinations. It needs input and summary to work.
Relevance: GEval summarization relevance score. The score is between 1 and 3. It assesses how completely and how focused the summary captures important information from the source transcript. 1 means major omissions, 2 means partial coverage, and 3 means complete and focused coverage. It needs input and summary to work.
Fluency: GEval summarization fluency score. The score is between 1 and 3. It assesses readability, naturalness, grammar, and clarity of the summary text. 1 means major readability problems, 2 means minor language issues, and 3 means clear and natural language. It needs input and summary to work.

Fields:

input (str) — The source text to be summarized (e.g., a meeting transcript).
summary (str) — The generated summary to be evaluated.

Output:

SummarizationEvaluator results score for each metric that is enabled with their explanation over the scoring system. Additionally SummarizationEvaluator also provides the normalized_score that ranges between 0 - 1. In aggregation, there are 3 scores provided, thus:

score: The default score that is determined from the rule engine based on relevancy_rating
- bad: 0
- incomplete: 0.5
- good: 1
binary_score: This score results 1 when the result is good, and results 0 for other relevancy_rating class.
avg_score: This score averages the normalized_score for each metric that is enabled.

The rule engine determines the relevancy_rating as follows:

good: coherence >= 2 AND consistency >= 3 AND relevance >= 3 AND fluency >= 2
bad: coherence <= 1 OR consistency <= 1 OR relevance <= 1 OR fluency <= 1
incomplete: anything in between

Example Usage

import asyncio
import json
import os

from gllm_evals import load_simple_summarization_dataset
from gllm_evals.evaluator.summarization_evaluator import SummarizationEvaluator


async def main():
    """Run a single summary evaluation example."""
    evaluator = SummarizationEvaluator(model_credentials=os.getenv("GOOGLE_API_KEY"))
    data = load_simple_summarization_dataset().load()[0]

    result = await evaluator.evaluate(data)
    print(json.dumps(result, indent=2))


if __name__ == "__main__":
    asyncio.run(main())

Or you can provide the data directly

import asyncio
import json
import os

from gllm_evals.evaluator.summarization_evaluator import SummarizationEvaluator
from gllm_evals.types import SummaryData


async def main():
    """Run a single summary evaluation example."""
    data = SummaryData(
        input="Meeting transcript or source text here...",
        summary="Generated summary here...",
    )

    evaluator = SummarizationEvaluator(model_credentials=os.getenv("GOOGLE_API_KEY"))

    result = await evaluator.evaluate(data)
    print(json.dumps(result, indent=2))


if __name__ == "__main__":
    asyncio.run(main())

Example Output

{
  "summary": {
    "global_explanation": "All metrics met the expected values.",
    "relevancy_rating": "good",
    "score": 1.0,
    "possible_issues": [],
    "binary_score": 1,
    "avg_score": 0.92,
    "coherence": {
      "score": 3,
      "explanation": "Ringkasan terstruktur dengan baik, mengikuti alur logis dari tujuan rapat, ringkasan eksekutif, isu, item tindakan, hingga item diskusi. Setiap bagian terhubung secara semantik dan tidak ada lompatan yang tiba-tiba.",
      "normalized_score": 1.0
    },
    "consistency": {
      "score": 3,
      "explanation": "Semua klaim faktual dalam ringkasan didukung oleh transkrip sumber. Tidak ditemukan halusinasi atau informasi yang tidak didukung.",
      "normalized_score": 1.0
    },
    "relevance": {
      "score": 3,
      "explanation": "Ringkasan mencakup semua bagian yang diperlukan dan menangkap informasi penting dari transkrip secara lengkap dan fokus.",
      "normalized_score": 1.0
    },
    "fluency": {
      "score": 2,
      "explanation": "Ringkasan dapat dipahami dengan baik, namun terdapat beberapa masalah minor pada tata bahasa dan pemilihan kata yang sedikit mengurangi kealamian teks.",
      "normalized_score": 0.67
    }
  }
}

Enabling Specific Metrics

You can enable only a subset of the four metrics:

evaluator = SummarizationEvaluator(
    model_credentials=os.getenv("GOOGLE_API_KEY"),
    enabled_metrics=["coherence", "fluency"],
)

When specific metrics are enabled, the rule book automatically adjusts to only consider those metrics for the relevancy_rating determination.

PreviousQueryTransformerEvaluator NextCreate Custom Evaluator / Scorer

Last updated 1 month ago

Was this helpful?

hashtagExample Usage

hashtagExample Output

Example Usage

Example Output