Group Relative Policy Optimization (GRPO)
What is Group Relative Policy Optimization (GRPO)?
Group Relative Policy Optimization (GRPO) is a reinforcement learning-based fine-tuning approach that optimizes a model using relative feedback across groups of candidate responses, rather than requiring absolute scores for individual outputs. For each input, the model generates multiple candidate responses that are evaluated by a reward function. GRPO then updates the policy by increasing the likelihood of higher-scoring responses and decreasing the likelihood of lower-scoring ones within the same group. This approach is particularly effective when you have preference data or quality comparisons between responses, and it typically produces more robust and preference-aligned model behaviors.
Installation
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-training"pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-training"FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-training"Quickstart
Let's move on to a basic example fine-tuned using GRPOTrainer. To run GRPO fine-tuning, you need to specify a reward function, model name, column_mapping and dataset path.
# Main Code
from gllm_training import GRPOTrainer
from examples.llm_as_judge_reward_function import output_format_reward
grpo_trainer = GRPOTrainer(
model_name="Qwen/Qwen3-0.6b",
datasets_path="examples/grpo_csv",
reward_functions=[output_format_reward]
)
grpo_trainer.train()
Fine tuning model using YAML file.
We can run experiments in a more structured way by using a YAML file. The current GRPO fine-tuning SDK supports both online data from Google Spreadsheets and local data in CSV format.
Example 1: Fine tuning using online data.
We can prepared our experiment using YAML file with the data trained and validation from google spreadsheet.
Configure environment variables (.env)
Fill in the GOOGLE_SHEETS_CLIENT_EMAIL and GOOGLE_SHEETS_PRIVATE_KEY fields. If you don’t have these keys, please contact the infrastructure team.
Share the spreadsheet
Share your Google Spreadsheet containing the training and validation data with the GOOGLE_SHEETS_CLIENT_EMAIL.
Experiment configuration (grpo_experiment_config.yml)
You can use a YAML file to plan your fine tuning experiments. To fine tuning with YAML, you need to define the required variables in the file.
Reward function
Reward functions evaluate model outputs and convert them into numerical feedback that GRPO uses to update the policy. In practice, a reward function:
Takes a batch of completions (model-generated outputs)
Computes a float reward for each completion.
Returns a list of floats where each element corresponds to exactly one completion.
Fine tuning
To run your fine-tuning, you need to load the YAML data using the YamlConfigLoader function, and select the experiment ID when executing the load function.
(Notes) Output format
Our SDK supports dictionary or string output formats for fine-tuned models.
Output dictionary
YAML format
Expected output
Output string
YAML format
Expected output
Example 2: Fine tuning using local data.
The remaining hyperparameter configurations for fine-tuning are the same as when using online data. Below is an example YAML configuration for using local data for training and validation.
Logging Monitoring
During the GRPO fine-tuning process, the SDK automatically generates comprehensive logs to help you monitor training progress and debug issues. These logs are stored in two formats:
JSONL Logs (Structured Training Metrics)
The SDK generates structured JSONL logs that capture detailed training metrics at each step. These logs are stored in:
Example path: data/grpo/model/exp_999/Qwen3-0.6b/logs/grpo_train_steps.jsonl
Each line in the JSONL file contains a JSON object with training metrics such as:
step: Training step numberloss: Training loss at that steplearning_rate: Current learning rateepoch: Current epoch numberrewards/mean: Average reward across generated responsesrewards/std: Standard deviation of rewardsrewards/margin: Difference between best and worst rewards in the grouppolicy_loss: Policy optimization lossAnd other relevant metrics
You can parse these logs programmatically or use tools like jq to analyze the training progression:
Note: The JSONL file contains both training steps and reward metrics. Use
select()to filter for the specific type of data you need.
Tensorboard Logs (Visual Monitoring)
For visual monitoring and analysis, the SDK also generates TensorBoard-compatible logs stored in:
Example path: data/grpo/model/exp_1/Qwen3-0.6b/logs_tensorboard
To visualize your training progress:
Launch TensorBoard:
Open your browser and navigate to
http://localhost:6006Monitor key metrics:
Training/Validation loss curves
Learning rate scheduling
Step-by-step progress
Custom metrics (if configured)
Log Configuration
You can customize logging behavior through the hyperparameters configuration:
Best Practices
Monitor reward signals: Check that average rewards increase over time - if rewards stay flat or decrease, your reward function may not be aligned with your goals or the model isn't learning the desired behavior.
Track policy stability: Use Tensorboard to see policy loss trends - sudden spikes in policy loss or reward variance indicate training instability and may require reducing learning rate.
Debug reward function: JSONL logs show reward statistics for each step - look for reward mean trends, check if reward margin is too small (responses are too similar), or if standard deviation is too high (inconsistent quality).
Monitor generation quality: GRPO generates multiple responses per input, so watch that reward mean increases while maintaining reasonable diversity (reward std shouldn't be zero).
Compare experiments: Save logs from each experiment to compare which reward functions and hyperparameters produce the best policy improvements - use TensorBoard to view multiple experiments side-by-side
Upload model to cloud storage
When running experiments, we don’t always save the model directly to the cloud. Instead, we may first evaluate its performance before uploading it to cloud storage. To support this workflow, we provide a save_model function that allows you to upload the model as a separate step after fine tuning.
Configure environment variable (.env)
Fill in the AWS_ACCESS_KEY, AWS_SECRET_KEY and AWS_REGION fields. If you don’t have these keys, please contact the infrastructure team.
Upload model
To upload the model, you need to configure the storage configuration and specify the model path on save_model function. The model path should point to the directory of your best adapter model.
Last updated