⚖️Multiple LLM-as-a-Judge

Multiple LLM-as-a-Judge is an advanced evaluation approach that uses multiple language models as judges to evaluate tasks in parallel and aggregate their results using ensemble methods. This approach provides higher alignment with human judgment and can significantly accelerate human annotation workflows.

Key Benefits

Higher Alignment: Multiple judges provide more reliable and consistent evaluations compared to single-judge approaches.
Faster Human Annotation: Humans can focus on scoring only cases where agreement score < 100%, reducing annotation workload.
Human Alignment: When agreement score reaches 100%, the alignment with human judgment is high.

The current module supports both categorical and numeric evaluations, with flexible ensemble methods for result aggregation.

Example Usage

from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.judge.multiple_llm_as_judge import MultipleLLMAsJudge
from gllm_evals import load_simple_qa_dataset

judge_models = [
    {
        "provider_model_id": "openai/gpt-5",
        "model_credentials": os.getenv("OPENAI_API_KEY"),
    },
    {
        "provider_model_id": "openai/gpt-4.1",
        "model_credentials": os.getenv("OPENAI_API_KEY"),
    },
]

data = load_simple_qa_dataset()

evaluator = GEvalGenerationEvaluator(
    judge=MultipleLLMAsJudge(judge_models=judge_models), 
    model_credentials=os.getenv("OPENAI_API_KEY")
)

result = await evaluator.evaluate(data.dataset[0])
print(results)

Example Output

{
	"generation": {
		"global_explanation": "Judge: openai/gpt-5\nAll metrics met the expected values.\n\nJudge: openai/gpt-4.1\nAll metrics met the expected values.",
		"ensemble_relevancy_rating": "good",
		"ensemble_method": "median",
		"weights": [1, 1],
		"agreement_score": 1.0,
		"judge_variance": 0.0,
		"individual_judge_results": [
			{
				"relevancy_rating": "good",
				"score": 1.0,
				"possible_issues": [],
				"binary_score": 1,
				"avg_score": 1.0,
				"completeness": {
					"score": 3,
					"explanation": "The question asks for the capital of France. The actual output 'Paris' exactly matches the expected output 'Paris', covering the sole key fact with no contradictions or omissions. No special format was required and the response meets it.",
					"success": True,
					"normalized_score": 1.0,
				},
				"groundedness": {
					"score": 3,
					"explanation": "The response 'Paris' directly answers the question and is explicitly supported by the retrieval context ('Paris is the capital of France'), with no unsupported or extraneous details.",
					"success": True,
					"normalized_score": 1.0,
				},
				"redundancy": {
					"score": 1,
					"explanation": "The response is a single word, 'Paris', presenting the key information once with no repeated words or restatement. It is concise and contains no unnecessary elaboration.",
					"normalized_score": 1.0,
				},
				"language_consistency": {
					"score": 1,
					"explanation": "The instruction language is English ('What is the capital of France?'). The actual output 'Paris' is a proper noun used as an English response and does not switch languages, so it is consistent.",
					"success": True,
					"normalized_score": 1.0,
				},
				"refusal_alignment": {
					"score": 1,
					"explanation": "is_refusal was detected from expected response; 'Paris' directly answers the question, so it is not a refusal. The actual output 'Paris' also directly answers with no refusal indicators. Both are not refusal, so the refusal statuses align.",
					"success": True,
					"normalized_score": 1.0,
				},
				"provider_model_id": "openai/gpt-5",
				"model_config": {},
			},
			{
				"relevancy_rating": "good",
				"score": 1.0,
				"possible_issues": [],
				"binary_score": 1,
				"avg_score": 1.0,
				"completeness": {
					"score": 3,
					"explanation": "The answer supplies the exact expected output with no missing elements, contradictions, or formatting issues. The question is directly and fully answered, matching all key facts.",
					"success": True,
					"normalized_score": 1.0,
				},
				"groundedness": {
					"score": 3,
					"explanation": "The response directly answers the question about the capital of France with the correct city, Paris, which is explicitly stated in the retrieval context. All information is fully grounded and matches the context, with no unsupported statements or irrelevant details.",
					"success": True,
					"normalized_score": 1.0,
				},
				"redundancy": {
					"score": 1,
					"explanation": "The response provides the answer concisely and directly, mentioning 'Paris' only once without any repetition, restatement, or unnecessary elaboration. There is no redundancy present.",
					"normalized_score": 1.0,
				},
				"language_consistency": {
					"score": 1,
					"explanation": "The controlling instructional language is English, as seen in the input question ('What is the capital of France?'). The actual output ('Paris') is also in English and responds appropriately with a proper noun that is universally recognized and acceptable in any language. No language inconsistency is present.",
					"success": True,
					"normalized_score": 1.0,
				},
				"refusal_alignment": {
					"score": 1,
					"explanation": "is_refusal was detected from expected response. The expected response directly answers the question ('Paris') with no refusal indicators, so expected behavior is not refusal. The actual output also answers the question without any refusal indicators. Both are not refusal, so the alignment is correct.",
					"success": True,
					"normalized_score": 1.0,
				},
				"provider_model_id": "openai/gpt-4.1",
				"model_config": {},
			},
		],
	}
}

How Scoring Works

Collect judge results from multiple LLM judges
Apply ensemble method:
- Median: Uses weighted median of scores (default)
- Average Rounded: Uses weighted average of scores
Calculate agreement score to measure consensus among judges.
1. For categorical ensemble: the percentage of judges with the same categorical rating.
2. For numerical ensemble: max(0.0, 1.0 - coefficient_of_variation) (lower variation = higher agreement)
Calculate judge variance for statistical analysis

PreviousAttachments NextCustom Evaluator / Scorer Tutorial

Last updated 7 days ago

Was this helpful?

hashtagKey Benefits

hashtagExample Usage

hashtagExample Output

hashtagHow Scoring Works

Key Benefits

Example Usage

Example Output

How Scoring Works