📗Custom Evaluator / Scorer Tutorial
In this guide, we’ll walk through how to build a custom evaluator for a specific use case. You can adapt the tutorial to match your project’s needs. In our example, the evaluator checks whether the summary of a customer’s complaint is correct and accurate based on the detailed description provided.
Step 0: Install the Required Libraries
To install gllm-evals, you can follow this section.
Step 1: Prepare Your Dataset
Before running the evaluation, we need a dataset that contains all the information required for scoring. For this example, we’ll use the following data:
This data has 5 columns:
no: The row number.detailed_description: The client’s full complaint description (this serves as the query).detailed_case_gangguan: The summarized complaint generated from the detailed description (this is the response we will evaluate).gt_detail_case_gangguan: The ground-truth summary, used to compare againstdetailed_case_gangguanto determine the true score. It is not used during evaluation itself, but it can be used afterward to measure the evaluator's accuracy.score_detail_case_gangguan: The ground-truth score representing how welldetailed_case_gangguanmatchesgt_detail_case_gangguan. It is not used during the evaluation itself, but it will be used afterward to measure the evaluator’s accuracy as the alignment score.
Step 2: Create a Custom Metric
Before we can evaluate our dataset, we need to decide which metric we will need to evaluate. Because this case is unique and specialized, we will create a custom metric using DeepEval's GEval with custom evaluation steps. Before proceeding, you can check the gllm-evals' metrics to decide whether to reuse the existing one or to create a custom one.
This is the example custom evaluation steps for the dataset above:
After that, we can create our custom metric:
Step 3: Create a Custom Evaluator
For the next step, we create our custom evaluator using the metric we have created:
Step 4: Perform Evaluation
To run the evaluation, we process our data, convert it into QAData, and pass it to the custom evaluator. You can adapt this step to fit your project’s specific requirements.
In addition to saving the evaluation's score and explanation, we can also compute a final alignment score to check the evaluator’s accuracy by comparing its output with the ground-truth score.
Below is the CSV output based on the evaluation we have just done:
Conclusion
This cookbook provides a simple guide to evaluating custom dataset using a custom metric and a custom evaluator. By following these steps, you can:
Monitor your QA system's performance
Automate the evaluation process if the alignment score is already high
Ensure reliable and high-quality QA responses in production
Last updated