Multiple LLM-as-a-Judge is an advanced evaluation approach that uses multiple language models as judges to evaluate tasks in parallel and aggregate their results using ensemble methods. This approach provides higher alignment with human judgment and can significantly accelerate human annotation workflows.
Key Benefits
Higher Alignment: Multiple judges provide more reliable and consistent evaluations compared to single-judge approaches.
Faster Human Annotation: Humans can focus on scoring only cases where agreement score < 100%, reducing annotation workload.
Human Alignment: When agreement score reaches 100%, the alignment with human judgment is high, as stated on this report.
The current module supports both categorical and numeric evaluations, with flexible ensemble methods for result aggregation.
{
"geval_generation_evals": {
"ensemble_relevancy_rating": "good",
"ensemble_method": "median",
"weights": [1, 1],
"agreement_score": 1.0,
"judge_variance": 0.0,
"individual_judge_results": [
{
"relevancy_rating": "good",
"possible_issues": [],
"score": 1,
"completeness": {
"score": 3,
"explanation": "The expected output has a single substantive statement ('Paris' as the capital). The actual output is 'Paris', exactly matching the expected information with no omissions or errors."
},
"groundedness": {
"score": 3,
"explanation": "The answer directly addresses the question by stating \u201cParis,\u201d which is explicitly supported by the context (\u201cParis is the capital of France\u201d). There are no extraneous or unsupported claims, and the response fully aligns with the question\u2019s intent."
},
"redundancy": {
"score": 1,
"explanation": "The output is a single-word answer, \u201cParis,\u201d presenting the key point only once with no repeated phrases, paraphrasing, or unnecessary elaboration. There is no intro or concluding repetition, and the response is fully concise and to the point."
},
"provider_model_id": "openai/gpt-5",
"model_config": {}
},
{
"relevancy_rating": "good",
"possible_issues": [],
"score": 1,
"completeness": {
"score": 3,
"explanation": "The generated output correctly matches the substantive statement from the expected output, which is 'Paris', as the answer to the capital of France. There are no missing or incorrect key details."
},
"groundedness": {
"score": 3,
"explanation": "The response 'Paris' is fully supported by the context, which states 'Paris is the capital of France.' It is accurate and directly addresses the question without including unsupported or extraneous information."
},
"redundancy": {
"score": 1,
"explanation": "The generated output directly answers the question by stating 'Paris' without any repetition, elaboration, or restatement. The response is concise, presents the key information only once, and contains no unnecessary or redundant content."
},
"provider_model_id": "openai/gpt-4.1",
"model_config": {}
}
]
}
}