⚖️Multiple LLM-as-a-Judge
Multiple LLM-as-a-Judge is an advanced evaluation approach that uses multiple language models as judges to evaluate tasks in parallel and aggregate their results using ensemble methods. This approach provides higher alignment with human judgment and can significantly accelerate human annotation workflows.
Key Benefits
Higher Alignment: Multiple judges provide more reliable and consistent evaluations compared to single-judge approaches.
Faster Human Annotation: Humans can focus on scoring only cases where agreement score < 100%, reducing annotation workload.
Human Alignment: When agreement score reaches 100%, the alignment with human judgment is high, as stated on this report.
The current module supports both categorical and numeric evaluations, with flexible ensemble methods for result aggregation.
Example Usage
from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.judge.multiple_llm_as_judge import MultipleLLMAsJudge
from gllm_evals import load_simple_qa_dataset
judge_models = [
{
"provider_model_id": "openai/gpt-5",
"model_credentials": os.getenv("OPENAI_API_KEY"),
},
{
"provider_model_id": "openai/gpt-4.1",
"model_credentials": os.getenv("OPENAI_API_KEY"),
},
]
data = load_simple_qa_dataset()
evaluator = GEvalGenerationEvaluator(
judge=MultipleLLMAsJudge(judge_models=judge_models),
model_credentials=os.getenv("OPENAI_API_KEY")
)
result = await evaluator.evaluate(data.dataset[0])
print(results)Example Output
How Scoring Works
Collect judge results from multiple LLM judges
Apply ensemble method:
Median: Uses weighted median of scores (default)
Average Rounded: Uses weighted average of scores
Calculate agreement score to measure consensus among judges.
For categorical ensemble: the percentage of judges with the same categorical rating.
For numerical ensemble:
max(0.0, 1.0 - coefficient_of_variation)(lower variation = higher agreement)
Calculate judge variance for statistical analysis
Last updated