πEvaluate GLChat Tutorial
In this guide, we will learn how to use evaluate_glchat to generate GLChat message response and evaluate its performance for QA dataset evaluation.
The evaluation focuses on question-answering capabilities with support for web search capability and PII handling. The dataset and experiment results then can be accessed in Langfuse for monitoring. To view more details on each component, you can click them in the sidebar inside the Evaluation page.
Prerequisites
Before you can start evaluating GLChat QA dataset, you need to prepare the following:
Required Parameters
1. User ID (user_id)
The user_id is a unique identifier for the user who will be interacting with the specified GLChat application. This information is needed to create a conversation or message.
Where to get it:
From your existing user in GLChat: If you already have an existing user in your GLChat application, you can use it as the
user_id.

You can also provide any user that has access to the application you want to test.
2. Chatbot ID (chatbot_id)
The chatbot_id identifies which chatbot or application configuration to use for the conversation. This information is needed to create a conversation or message.
3. [Optional] Model Name (model_name)
The model_name specifies which language model to use for generating GLChat response. Model name can be filled with the model display name in an application / chatbot. If not specified, the response will be generated using the default model there.

Required Keys
For GLChat
We will also need access to the GLChat credentials to generate the response. Please contact GLChat team if you do not have them yet. The required keys are: GLCHAT_BASE_URL and GLCHAT_API_KEY.
For Langfuse
We will need Langfuse credentials to trace, debug, and view the evaluation results for our GLChat QA system. If you do not have any Langfuse credentials yet, you can follow the New User Configuration to get them. The required keys are: LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST.
Step 0: Install the Required Libraries
We need to install the required libraries for GLChat evaluation, including GLChat SDK and Langfuse.
Install GLChat SDK with evals
The evals module inside glchat-sdk is currently private and requires special access. To use the evaluation functionality, you need to install the package with the evals extra.
Using poetry
Using pip
gllm-evals and langfuse should have been included when you install the glchat-sdk with evals extra.
Step 1: Setup Environment and Configuration
Prepare the environment variables for the evaluation script:
With the environment variables set, we can now verify and use GLChat SDK and Langfuse.
Step 2: Prepare Your Dataset
Before we can evaluate, we need to prepare a dataset with all the information needed for evaluation.
To ensure compatibility, your dataset must use standardized column names. We enforce a strict naming convention so the module can automatically recognize and process your data correctly.
Before using the module, please make sure your dataset columns match the required names exactly (case sensitive).
question_id
Unique identifier for each query.
query
The question to ask.
expected_response
The expected answer to be compared.
search_type
("normal" or "search")
Whether to enable search functionality in GLChat.
All rows will be set to "normal" (no search capability) if the column is not stated.
enable_pii
(True or False)
Whether to enable PII processing.
All rows will be set to False (no pii masking) if the column is not stated.
model_name
The model to be used for response generation for each row in GLChat.
If the column is not stated, it will check the provided config for the global configuration. If it is also not stated, it will use the default model based on the provided chatbot.
chatbot_id
The chatbot id to be used for response generation for each row in GLChat.
If the column is not stated, it will check the provided config for the global configuration.
attachments
The file names to be used for each row. Left empty for rows not using any attachments.
This column is mandatory ONLY if you have attachment(s) to be used for response generation. To see more details, you can visit the Attachments page.
Other additional fields
Any additional fields you deem necessary to be included. Will not affect the evaluation process.
For example purpose, you can download the following CSV file:
Step 3: Instrument your GLChat Configuration
Before we can evaluate our GLChat system, we need to create a GLChat configuration using GLChatConfig to set what configuration to use.
π’ Minimum Configuration (Bare Minimum)
Use this when you just want to get started fast with default settings.
Thatβs all you need β the rest will be handled by the module using defaults.
π΅ Full Configuration (Complete Example)
Use this version if you want full control over every parameter and behavior.
π‘ Tip: Start with the minimal config, and gradually add more if you need more customization. If the config parameter is also available the dataset column, it will prioritize the one in the dataset column.
Step 4: Prepare Attachments (Optional)
If your dataset does not need any attachment, feel free to skip this step. To find out more about what type of attachments currently supported and how to set the attachments configuration based on each type, you can visit the Attachments page.
In this example, we use the local attachment as the simplest setup. Regardless of the attachment type you choose, ensure that your files are already stored in a storage location we currently support.
For this dataset example, you can download the file and put it in your local directory:
After that, you can now create a local attachment configuration. For example, if you put the above image to the local path /home/user/Documents/files/gambar kartini.jpg, then you can add the following parameter based on the attachment type:
Step 5: Perform end-to-end Evaluation
To run the end-to-end evaluation, we can use a convenience function in glchat-sdk called evaluate_glchat. This function provides a streamlined interface for evaluating GLChat models using the existing gllm-evals framework. It eliminates the need to manually implement inference functions by providing a pre-built GLChat integration.
We can pass the dataset, the GLChat configuration, and the attachment config to the evaluate_glchat() function. In this example, we will use GEvalGenerationEvaluator which is suitable for evaluating QA dataset. Since we want to use Langfuse, we will also use LangfuseExperimentTracker as the dedicated experiment tracker.
Step 6: View Evaluation Results in Langfuse
After running the evaluation, all the dataset and experiment you've provided will automatically be logged to Langfuse. This step shows you how to navigate the Langfuse UI and interpret your evaluation results.
Accessing Your Langfuse Dashboard
Navigate to your Langfuse project: Go to https://langfuse.obrol.id/.
Select your organization and project: Choose your dedicated organization and project (or that you've just created).
Access the dashboard: You'll see various sections for analyzing your data.
View Dataset
To view the dataset you've just created, you can go to: Project β Datasets β select a dataset β Items. In this page, we can see all the data rows you have just evaluated based on the provided Langfuse mapping. This dataset can also be reused for future evaluation.

To see more detail for each row, you can click one of the data item above.

View Dataset Runs (Leaderboard)
Dataset runs are the executions over a dataset with per-item output. A dataset run represents an experiment. To view the dataset runs, you can go to: Project β Datasets β select a dataset β Runs. In here, you can view all the scores for each experiment, including LLM-as-a-judge score columnsβboth as aggregations and per-row values.

You can also click a specific dataset run to view all the data rows result for each experiment:

View Traces / Observations
Trace / observation let you drill into individual spans, view the inputs, outputs (our evaluation results), and metadata. You can go to: Project β Traces.
Below is the trace example:

View Sessions (Experiment Results)
Sessions contain grouped traces per experiment; you can review and annotate each data trace in sessions. You can access the sessions in Project β Sessions.
Below is the session screenshot example:

Congratulation! You have just created your first GLChat QA Evaluation!
Export to CSV
You also can optionally export the experiment results in Langfuse to CSV by running the following metrics:
Conclusion
This cookbook provides a simple guide to evaluating GLChat QA systems using Langfuse. By following these steps, you can:
Monitor your QA system's performance
Evaluate different models and configurations systematically
Track quality metrics and identify improvement opportunities
Ensure reliable and high-quality QA responses in production
Note: This is a simple guide to get you started with GLChat QA evaluation using Langfuse. For more comprehensive evaluation information and advanced techniques, please refer to the evaluation gitbook.
Last updated