Multiple LLM-as-a-Judge is an advanced evaluation approach that uses multiple language models as judges to evaluate tasks in parallel and aggregate their results using ensemble methods. This approach provides higher alignment with human judgment and can significantly accelerate human annotation workflows.
Key Benefits
Higher Alignment: Multiple judges provide more reliable and consistent evaluations compared to single-judge approaches.
Faster Human Annotation: Humans can focus on scoring only cases where agreement score < 100%, reducing annotation workload.
Human Alignment: When agreement score reaches 100%, the alignment with human judgment is high.
The current module supports both categorical and numeric evaluations, with flexible ensemble methods for result aggregation.
{
"generation": {
"global_explanation": "Judge: openai/gpt-5\nAll metrics met the expected values.\n\nJudge: openai/gpt-4.1\nAll metrics met the expected values.",
"ensemble_relevancy_rating": "good",
"ensemble_method": "median",
"weights": [1, 1],
"agreement_score": 1.0,
"judge_variance": 0.0,
"individual_judge_results": [
{
"relevancy_rating": "good",
"score": 1.0,
"possible_issues": [],
"binary_score": 1,
"avg_score": 1.0,
"completeness": {
"score": 3,
"explanation": "The question asks for the capital of France. The actual output 'Paris' exactly matches the expected output 'Paris', covering the sole key fact with no contradictions or omissions. No special format was required and the response meets it.",
"success": True,
"normalized_score": 1.0,
},
"groundedness": {
"score": 3,
"explanation": "The response 'Paris' directly answers the question and is explicitly supported by the retrieval context ('Paris is the capital of France'), with no unsupported or extraneous details.",
"success": True,
"normalized_score": 1.0,
},
"redundancy": {
"score": 1,
"explanation": "The response is a single word, 'Paris', presenting the key information once with no repeated words or restatement. It is concise and contains no unnecessary elaboration.",
"normalized_score": 1.0,
},
"language_consistency": {
"score": 1,
"explanation": "The instruction language is English ('What is the capital of France?'). The actual output 'Paris' is a proper noun used as an English response and does not switch languages, so it is consistent.",
"success": True,
"normalized_score": 1.0,
},
"refusal_alignment": {
"score": 1,
"explanation": "is_refusal was detected from expected response; 'Paris' directly answers the question, so it is not a refusal. The actual output 'Paris' also directly answers with no refusal indicators. Both are not refusal, so the refusal statuses align.",
"success": True,
"normalized_score": 1.0,
},
"provider_model_id": "openai/gpt-5",
"model_config": {},
},
{
"relevancy_rating": "good",
"score": 1.0,
"possible_issues": [],
"binary_score": 1,
"avg_score": 1.0,
"completeness": {
"score": 3,
"explanation": "The answer supplies the exact expected output with no missing elements, contradictions, or formatting issues. The question is directly and fully answered, matching all key facts.",
"success": True,
"normalized_score": 1.0,
},
"groundedness": {
"score": 3,
"explanation": "The response directly answers the question about the capital of France with the correct city, Paris, which is explicitly stated in the retrieval context. All information is fully grounded and matches the context, with no unsupported statements or irrelevant details.",
"success": True,
"normalized_score": 1.0,
},
"redundancy": {
"score": 1,
"explanation": "The response provides the answer concisely and directly, mentioning 'Paris' only once without any repetition, restatement, or unnecessary elaboration. There is no redundancy present.",
"normalized_score": 1.0,
},
"language_consistency": {
"score": 1,
"explanation": "The controlling instructional language is English, as seen in the input question ('What is the capital of France?'). The actual output ('Paris') is also in English and responds appropriately with a proper noun that is universally recognized and acceptable in any language. No language inconsistency is present.",
"success": True,
"normalized_score": 1.0,
},
"refusal_alignment": {
"score": 1,
"explanation": "is_refusal was detected from expected response. The expected response directly answers the question ('Paris') with no refusal indicators, so expected behavior is not refusal. The actual output also answers the question without any refusal indicators. Both are not refusal, so the alignment is correct.",
"success": True,
"normalized_score": 1.0,
},
"provider_model_id": "openai/gpt-4.1",
"model_config": {},
},
],
}
}