⚖️Multiple LLM-as-a-Judge

Multiple LLM-as-a-Judge is an advanced evaluation approach that uses multiple language models as judges to evaluate tasks in parallel and aggregate their results using ensemble methods. This approach provides higher alignment with human judgment and can significantly accelerate human annotation workflows.

Key Benefits

  1. Higher Alignment: Multiple judges provide more reliable and consistent evaluations compared to single-judge approaches.

  2. Faster Human Annotation: Humans can focus on scoring only cases where agreement score < 100%, reducing annotation workload.

  3. Human Alignment: When agreement score reaches 100%, the alignment with human judgment is high, as stated on this reportarrow-up-right.

The current module supports both categorical and numeric evaluations, with flexible ensemble methods for result aggregation.

Example Usage

from gllm_evals.evaluator.geval_generation_evaluator import GEvalGenerationEvaluator
from gllm_evals.judge.multiple_llm_as_judge import MultipleLLMAsJudge
from gllm_evals import load_simple_qa_dataset

judge_models = [
    {
        "provider_model_id": "openai/gpt-5",
        "model_credentials": os.getenv("OPENAI_API_KEY"),
    },
    {
        "provider_model_id": "openai/gpt-4.1",
        "model_credentials": os.getenv("OPENAI_API_KEY"),
    },
]

data = load_simple_qa_dataset()

evaluator = GEvalGenerationEvaluator(
    judge=MultipleLLMAsJudge(judge_models=judge_models), 
    model_credentials=os.getenv("OPENAI_API_KEY")
)

result = await evaluator.evaluate(data.dataset[0])
print(results)

Example Output


How Scoring Works

  1. Collect judge results from multiple LLM judges

  2. Apply ensemble method:

    • Median: Uses weighted median of scores (default)

    • Average Rounded: Uses weighted average of scores

  3. Calculate agreement score to measure consensus among judges.

    1. For categorical ensemble: the percentage of judges with the same categorical rating.

    2. For numerical ensemble: max(0.0, 1.0 - coefficient_of_variation) (lower variation = higher agreement)

  4. Calculate judge variance for statistical analysis

Last updated