Group Relative Policy Optimization (GRPO)
What is Group Relative Policy Optimization (GRPO)?
Group Relative Policy Optimization (GRPO) is a reinforcement learning-based fine-tuning approach that optimizes a model using relative feedback across groups of candidate responses, rather than requiring absolute scores for individual outputs. For each input, the model generates multiple candidate responses that are evaluated by a reward function. GRPO then updates the policy by increasing the likelihood of higher-scoring responses and decreasing the likelihood of lower-scoring ones within the same group. This approach is particularly effective when you have preference data or quality comparisons between responses, and it typically produces more robust and preference-aligned model behaviors.
Installation
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-training"pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-training"FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-training"Quickstart
Let's move on to a basic example fine-tuned using GRPOTrainer. To run GRPO fine-tuning, you need to specify a reward function, model name, column_mapping and dataset path.
# Main Code
from gllm_training import GRPOTrainer
from examples.llm_as_judge_reward_function import output_format_reward
grpo_trainer = GRPOTrainer(
model_name="Qwen/Qwen3-0.6b",
datasets_path="examples/grpo_csv",
reward_functions=[output_format_reward]
)
grpo_trainer.train()
Fine tuning model using YAML file.
We can run experiments in a more structured way by using a YAML file. The current GRPO fine-tuning SDK supports both online data from Google Spreadsheets and local data in CSV format.
Example 1: Fine tuning using online data.
We can prepared our experiment using YAML file with the data trained and validation from google spreadsheet.
Reward function
Reward functions evaluate model outputs and convert them into numerical feedback that GRPO uses to update the policy. In practice, a reward function:
Takes a batch of completions (model-generated outputs)
Computes a float reward for each completion.
Returns a list of floats where each element corresponds to exactly one completion.
Example 2: Fine tuning using local data.
The remaining hyperparameter configurations for fine-tuning are the same as when using online data. Below is an example YAML configuration for using local data for training and validation.
Upload model to cloud storage
When running experiments, we don’t always save the model directly to the cloud. Instead, we may first evaluate its performance before uploading it to cloud storage. To support this workflow, we provide a save_model function that allows you to upload the model as a separate step after fine tuning.
Last updated