Group Relative Policy Optimization (GRPO)

What is Group Relative Policy Optimization (GRPO)?

Group Relative Policy Optimization (GRPO) is a reinforcement learning-based fine-tuning approach that optimizes a model using relative feedback across groups of candidate responses, rather than requiring absolute scores for individual outputs. For each input, the model generates multiple candidate responses that are evaluated by a reward function. GRPO then updates the policy by increasing the likelihood of higher-scoring responses and decreasing the likelihood of lower-scoring ones within the same group. This approach is particularly effective when you have preference data or quality comparisons between responses, and it typically produces more robust and preference-aligned model behaviors.

Prerequisites

Before installing, make sure you have:

Python 3.11+
Pip or Poetry
gcloud CLI - required because gllm-training is a private library hosted in a private Google Cloud repository

After installing, please run

gcloud auth login

to authorize gcloud to access the Cloud Platform with Google user credentials.

Our internal gllm-training package is hosted in a secure Google Cloud Artifact Registry. You need to authenticate via gcloud CLI to access and download the package during installation.

The minimum requirements:
1. CUDA-compatible GPU
2. Recommendation GPU:
  1. RTX A5000
  2. RTX 40/50 series.
3. Windows/Linux, currently not support for macOS

Installation

pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-training"

Quickstart

Let's move on to a basic example fine-tuned using GRPOTrainer. To run GRPO fine-tuning, you need to specify a reward function, model name, column_mapping and dataset path.

# Main Code
from gllm_training import GRPOTrainer
from examples.llm_as_judge_reward_function import output_format_reward

grpo_trainer = GRPOTrainer(
    model_name="Qwen/Qwen3-0.6b",
    datasets_path="examples/grpo_csv",
    reward_functions=[output_format_reward]
)
grpo_trainer.train()

Fine tuning model using YAML file.

We can run experiments in a more structured way by using a YAML file. The current GRPO fine-tuning SDK supports both online data from Google Spreadsheets and local data in CSV format.

Example 1: Fine tuning using online data.

We can prepared our experiment using YAML file with the data trained and validation from google spreadsheet.

Configure environment variables (.env)

Fill in the GOOGLE_SHEETS_CLIENT_EMAIL and GOOGLE_SHEETS_PRIVATE_KEY fields. If you don’t have these keys, please contact the infrastructure team.

GOOGLE_SHEETS_CLIENT_EMAIL="your-service-account@project.iam.gserviceaccount.com"
GOOGLE_SHEETS_PRIVATE_KEY="-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n"

Share your Google Spreadsheet containing the training and validation data with the GOOGLE_SHEETS_CLIENT_EMAIL.

Share -> add your google sheet client email -> set as editor

`Experiment configuration (grpo_experiment_config.yml)`

You can use a YAML file to plan your fine tuning experiments. To fine tuning with YAML, you need to define the required variables in the file.

fine_tune_hyperparam_1: &fine_tune_hyperparam_1_conf
  hyperparameters_id: "1"
  max_seq_length: 16000
  load_in_16bit: true
  full_finetuning: false
  enable_thinking: true
  thinking_mode: "off"
  add_generation_prompt: false
  fast_inference: true

  # LoraConfig
  r: 16
  lora_alpha: 32
  lora_dropout: 0.1
  bias: "none"
  target_modules:
    - "q_proj"
    - "k_proj"
    - "v_proj"
    - "o_proj"
  use_rslora: false
  logging_steps: 2
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 4
  num_generations: 4
  num_train_epochs: 1
  save_steps: 50  
  model_output_dir: "data/grpo/model"
  resume_from_checkpoint: false
  model_cache_dir: ".hf_cache"

vllm_sampling_params: &vllm_sampling_params_conf
  min_p: 0.1
  top_p: 1.0
  top_k: -1
  seed: 42
  max_tokens: 16000
  include_stop_str_in_output: true
  temperature: 0.6

experiment_1:
  experiment_id: "1"
  hyperparameters: *fine_tune_hyperparam_1_conf
  vllm_sampling_params: *vllm_sampling_params_conf
  storage_config: *s3_storage_config
  topic: "LLM AS A JUDGE GRPO"
  grpo_ability: "LLM AS A JUDGE"
  model_name: "Qwen/Qwen3-0.6b"
  framework: "unsloth"
  multimodal: false
  spreadsheet_id: "1YbnRUq9ef_ZeYJ78ae8uY9rCaqY3hMz00HwgxZKvCKU"
  train_sheet: "botanica_completeness_train"
  validation_sheet: "botanica_completeness_train"
  prompt_sheet: "prompt_single"
  prompt_name: "completeness_prompt_single_8_2_no_think"
  column_mapping_config:
    input_columns:
      query: "question"
      generated_response: "generated_response"
      expected_response: "expected_response"
    label_columns:
      score: "completeness_score"
      reason: "completeness_explanation"
  save_processed_dataset: true
  output_processed_dir: "data/grpo/dataset"

Reward function

Reward functions evaluate model outputs and convert them into numerical feedback that GRPO uses to update the policy. In practice, a reward function:

Takes a batch of completions (model-generated outputs)
Computes a float reward for each completion.
Returns a list of floats where each element corresponds to exactly one completion.

import re

def output_format_reward(completions: list[any], **kwargs: any) -> list[float]:  # noqa PLR0912
    rewards = []
    for completion in completions:
        # Extract content from nested structure
        if isinstance(completion, list) and len(completion) > 0:
            if isinstance(completion[0], list) and len(completion[0]) > 0:
                text = completion[0][0].get("content", "")
            elif isinstance(completion[0], dict):
                text = completion[0].get("content", "")
            else:
                text = str(completion)
        else:
            text = str(completion)

        reward = 0.0
        text_cleaned = text.strip()
        if re.search(r'"score"[\s]*:', text_cleaned, re.IGNORECASE):
            reward += 1.0
        else:
            reward -= 1.0
        if re.search(r'"reason"[\s]*:', text_cleaned, re.IGNORECASE):
            reward += 1.0
        else:
            reward -= 1.0
        think_match = re.search(r"<think>(.*?)</think>", text_cleaned, re.DOTALL)
        if think_match:
            think_content = think_match.group(1).strip()
            if not think_content:  # Empty or whitespace only
                reward += 1.0
            else:
                reward -= 1.0
        else:
            reward += 1.0

        rewards.append(reward)
    return rewards

Fine tuning

To run your fine-tuning, you need to load the YAML data using the YamlConfigLoader function, and select the experiment ID when executing the load function.

from dotenv import load_dotenv
from gllm_training import YamlConfigLoader
from gllm_training import GRPOTrainer
load_dotenv(override=True)

config_loader = YamlConfigLoader(base_dir="./")
config = config_loader.load("grpo_experiment_config.yaml", "experiment_1")
grpo_trainer = GRPOTrainer(
    **config,
    reward_functions=[output_format_reward]
)
grpo_trainer.train()

(Notes) Output format

Our SDK supports dictionary or string output formats for fine-tuned models.

Output dictionary

YAML format

column_mapping_config:
  label_columns:
    label: "target"

Expected output

{
    "label": "Pentingnya Risiko Kepatuhan Terintegrasi di perbankan"
}

Output string

YAML format

column_mapping_config:
  label_columns: "target"

Expected output

"Pentingnya Risiko Kepatuhan Terintegrasi di perbankan"

Example 2: Fine tuning using local data.

The remaining hyperparameter configurations for fine-tuning are the same as when using online data. Below is an example YAML configuration for using local data for training and validation.

experiment_2:
  experiment_id: "2"
  hyperparameters: *fine_tune_hyperparam_1_conf
  vllm_sampling_params: *vllm_sampling_params_conf
  storage_config: *s3_storage_config
  topic: "LLM AS A JUDGE GRPO"
  grpo_ability: "LLM AS A JUDGE"
  model_name: "Qwen/Qwen3-0.6b"
  framework: "unsloth"
  multimodal: false
  datasets_path: "examples/grpo_csv"
  train_filename: "training_data.csv"
  validation_filename: "validation_data.csv"
  prompt_filename: "prompt_data.csv"
  prompt_name: "prompt_default"
  column_mapping_config:
    input_columns:
      query: "question"
      generated_response: "generated_response"
      expected_response: "expected_response"
    label_columns:
      score: "completeness_score"
      reason: "completeness_explanation"
  save_processed_dataset: true
  output_processed_dir: "data/grpo/dataset"

Upload model to cloud storage

When running experiments, we don’t always save the model directly to the cloud. Instead, we may first evaluate its performance before uploading it to cloud storage. To support this workflow, we provide a save_model function that allows you to upload the model as a separate step after fine tuning.

Configure environment variable (.env)

Fill in the AWS_ACCESS_KEY, AWS_SECRET_KEY and AWS_REGION fields. If you don’t have these keys, please contact the infrastructure team.

AWS_ACCESS_KEY="<AWS_ACCESS_KEY>"
AWS_SECRET_KEY="<AWS_SECRET_KEY>"
AWS_REGION="<AWS_REGION>"

Upload model

To upload the model, you need to configure the storage configuration and specify the model path on save_model function. The model path should point to the directory of your best adapter model.

from gllm_training import GRPOTrainer
from gllm_training.schema import StorageConfig

finetuner = GRPOTrainer(
    model_name="Qwen/Qwen3-0.6b",
    storage_config=StorageConfig(
        provider="s3",
        upload_to_cloud=True,
        object_prefix="fine-tuned-models",
        bucket_name="glair-gen-ai-llm-model",
    ),
)
finetuner.save_existing_model(
    model_path="data/fine_tuned/exp_1/Qwen3-1.7b/exp_id_1:grpo_fine_tune_hyperparam_1:prompt_1:Qwen3-1.7b:adapter"
)

PreviousSupervised Fine Tuning (SFT)NextDirect Preference Optimization (DPO)

Last updated 2 months ago

Was this helpful?

hashtagWhat is Group Relative Policy Optimization (GRPO)?

hashtagInstallation

hashtagQuickstart

hashtagFine tuning model using YAML file.

hashtagExample 1: Fine tuning using online data.

hashtagConfigure environment variables (.env)

hashtagShare the spreadsheet

hashtagExperiment configuration (grpo_experiment_config.yml)

hashtagReward function

hashtagFine tuning

hashtag(Notes) Output format

hashtagExample 2: Fine tuning using local data.

hashtagUpload model to cloud storage

hashtagConfigure environment variable (.env)

hashtagUpload model