⚖️Multiple LLM-as-a-Judge

Multiple LLM-as-a-Judge is an advanced evaluation approach that uses multiple language models as judges to evaluate tasks in parallel and aggregate their results using ensemble methods. This approach provides higher alignment with human judgment and can significantly accelerate human annotation workflows.

Key Benefits

Higher Alignment: Multiple judges provide more reliable and consistent evaluations compared to single-judge approaches.
Faster Human Annotation: Humans can focus on scoring only cases where agreement score < 100%, reducing annotation workload.
Human Alignment: When agreement score reaches 100%, the alignment with human judgment is high, as stated on this report.

The current module supports both categorical and numeric evaluations, with flexible ensemble methods for result aggregation.

Example Usage

from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.judge.multiple_llm_as_judge import MultipleLLMAsJudge
from gllm_evals import load_simple_qa_dataset

judge_models = [
    {
        "provider_model_id": "openai/gpt-5",
        "model_credentials": os.getenv("OPENAI_API_KEY"),
    },
    {
        "provider_model_id": "openai/gpt-4.1",
        "model_credentials": os.getenv("OPENAI_API_KEY"),
    },
]

data = load_simple_qa_dataset()

evaluator = GEvalGenerationEvaluator(
    judge=MultipleLLMAsJudge(judge_models=judge_models), 
    model_credentials=os.getenv("OPENAI_API_KEY")
)

result = await evaluator.evaluate(data.dataset[0])
print(results)

Example Output

{
  "geval_generation_evals": {
    "ensemble_relevancy_rating": "good",
    "ensemble_method": "median",
    "weights": [1, 1],
    "agreement_score": 1.0,
    "judge_variance": 0.0,
    "individual_judge_results": [
      {
        "relevancy_rating": "good",
        "possible_issues": [],
        "score": 1,
        "completeness": {
          "score": 3,
          "explanation": "The expected output has a single substantive statement ('Paris' as the capital). The actual output is 'Paris', exactly matching the expected information with no omissions or errors."
        },
        "groundedness": {
          "score": 3,
          "explanation": "The answer directly addresses the question by stating \u201cParis,\u201d which is explicitly supported by the context (\u201cParis is the capital of France\u201d). There are no extraneous or unsupported claims, and the response fully aligns with the question\u2019s intent."
        },
        "redundancy": {
          "score": 1,
          "explanation": "The output is a single-word answer, \u201cParis,\u201d presenting the key point only once with no repeated phrases, paraphrasing, or unnecessary elaboration. There is no intro or concluding repetition, and the response is fully concise and to the point."
        },
        "provider_model_id": "openai/gpt-5",
        "model_config": {}
      },
      {
        "relevancy_rating": "good",
        "possible_issues": [],
        "score": 1,
        "completeness": {
          "score": 3,
          "explanation": "The generated output correctly matches the substantive statement from the expected output, which is 'Paris', as the answer to the capital of France. There are no missing or incorrect key details."
        },
        "groundedness": {
          "score": 3,
          "explanation": "The response 'Paris' is fully supported by the context, which states 'Paris is the capital of France.' It is accurate and directly addresses the question without including unsupported or extraneous information."
        },
        "redundancy": {
          "score": 1,
          "explanation": "The generated output directly answers the question by stating 'Paris' without any repetition, elaboration, or restatement. The response is concise, presents the key information only once, and contains no unnecessary or redundant content."
        },
        "provider_model_id": "openai/gpt-4.1",
        "model_config": {}
      }
    ]
  }
}

How Scoring Works

Collect judge results from multiple LLM judges
Apply ensemble method:
- Median: Uses weighted median of scores (default)
- Average Rounded: Uses weighted average of scores
Calculate agreement score to measure consensus among judges.
1. For categorical ensemble: the percentage of judges with the same categorical rating.
2. For numerical ensemble: max(0.0, 1.0 - coefficient_of_variation) (lower variation = higher agreement)
Calculate judge variance for statistical analysis

PreviousAttachments NextEvaluate GLChat Tutorial

Last updated 5 months ago

Was this helpful?

hashtagKey Benefits

hashtagExample Usage

hashtagExample Output

hashtagHow Scoring Works

Key Benefits

Example Usage

Example Output

How Scoring Works