Direct Preference Optimization (DPO)
What is Direct Preference Optimization (DPO)?
Direct Preference Optimization (DPO) is a preference-based fine-tuning technique that aligns a model using paired comparisons between responses, rather than relying on reinforcement learning or reward models. For each input prompt, DPO uses a chosen response (preferred) and a rejected response (less preferred) to directly increase the likelihood of generating the chosen output while decreasing the likelihood of the rejected one. This is achieved through a closed-form optimization objective that simplifies training while still capturing preference signals effectively. DPO is particularly useful when you have datasets that express relative human preferences, and it typically produces stable, efficient, and preference-aligned model behaviors.
Prerequisites
Before installing, make sure you have:
Pip or
gcloud CLI - required because
gllm-trainingis a private library hosted in a private Google Cloud repository
After installing, please run
gcloud auth loginto authorize gcloud to access the Cloud Platform with Google user credentials.
Our internal gllm-training package is hosted in a secure Google Cloud Artifact Registry.
You need to authenticate via gcloud CLI to access and download the package during installation.
The minimum requirements:
CUDA-compatible GPU
Recommendation GPU:
RTX A5000
RTX 40/50 series.
Windows/Linux, currently not support for macOS
Installation
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-training"pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-training"FOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" "gllm-training"Quickstart
Let's move on to a basic example fine-tuned using DPOTrainer. To run DPO fine-tuning, you need to specify a model name, dpo_column_mapping and dataset path. Make sure your data sets contained of prompt, chosen as a correct response and rejected as a rejected response.
# Main Code
from gllm_training import DPOTrainer
dpo_trainer = DPOTrainer(
model_name="Qwen/Qwen3-0.6b",
datasets_path="examples/dpo_csv"
)
dpo_trainer.train()
Fine tuning model using YAML file.
We can run experiments in a more structured way by using a YAML file. The current DPO fine-tuning SDK supports both online data from Google Spreadsheets and local data in CSV format.
Example 1: Fine tuning using online data.
We can prepared our experiment using YAML file with the data trained and validation from google spreadsheet.
Configure environment variables (.env)
Fill in the GOOGLE_SHEETS_CLIENT_EMAIL and GOOGLE_SHEETS_PRIVATE_KEY fields. If you don’t have these keys, please contact the infrastructure team.
Share the spreadsheet
Share your Google Spreadsheet containing the training and validation data with the GOOGLE_SHEETS_CLIENT_EMAIL.
Experiment configuration (dpo_experiment_config.yml)
You can use a YAML file to plan your fine tuning experiments. To fine tuning with YAML, you need to define the required variables in the file.
(Notes) column_mapping_config
The configuration is split into two main parts: input_columns and label_columns.
Input columns
The input_columns section maps placeholders in your user prompt template to the actual column names in your dataset.
placeholder_in_user_prompt: The placeholder name inside the
userprompt template (e.g.,query)column_name_in_your_data: The actual column name from your Google Sheet or CSV file (e.g.,
user_query)
Output columns
Output columns supports dictionary or string formats for fine-tuned models.
Output dictionary
YAML format
Expected output
Output string
YAML format
Expected output
Fine tuning
To run your fine-tuning, you need to load the YAML data using the YamlConfigLoader function, and select the experiment ID when executing the load function.
Example 2: Fine tuning using local data.
The remaining hyperparameter configurations for fine-tuning are the same as when using online data. Below is an example YAML configuration for using local data for training and validation.
Datasets Format
Data training and validation
The column names should correspond to what you define in the column_mapping_config.
Minimum Required Columns:
prompt
The input query or instruction provided to the model.
chosen
The preferred or "good" response that the model should learn to favor.
rejected
The less preferred or "bad" response that the model should learn to avoid.
Prompts
The prompt data should contained columns:
name
A unique identifier for the prompt template
system
The system prompt, which sets the model's role and context. It does not contain placeholders
user
The user prompt template. It must contain placeholders (e.g., {prompt}) that will be replaced by data from your input columns
Logging Monitoring
During the DPO fine-tuning process, the SDK automatically generates comprehensive logs to help you monitor training progress and debug issues. These logs are stored in two formats:
JSONL Logs (Structured Training Metrics)
The SDK generates structured JSONL logs that capture detailed training metrics at each step. These logs are stored in:
Example path: data/dpo/model/exp_1/Qwen3-0.6b/logs/dpo_train_steps.jsonl
Each line in the JSONL file contains a JSON object with training metrics such as:
step: Training step numberloss: DPO training loss at that steplearning_rate: Current learning rateepoch: Current epoch numberrewards/chosen: Implicit rewards for chosen (preferred) responsesrewards/rejected: Implicit rewards for rejected (less preferred) responsesrewards/margins: Margin between chosen and rejected rewardslogps/chosen: Log probabilities of chosen responseslogps/rejected: Log probabilities of rejected responsesAnd other relevant metrics
You can parse these logs programmatically or use tools like jq to analyze the training progression:
Note: The JSONL file contains training metrics including preference margins between chosen and rejected responses. Use select() to filter for the specific type of data you need. The rewards/accuracies metric shows how often the model correctly predicts which response is preferred.
TensorBoard Logs (Visual Monitoring)
For visual monitoring and analysis, the SDK also generates TensorBoard-compatible logs stored in:
Example path: data/dpo/model/exp_999/Qwen3-0.6b/logs_tensorboard
To visualize your training progress:
Launch TensorBoard:
Open your browser and navigate to
http://localhost:6006Monitor key metrics:
Training loss curves
Reward margins (chosen vs rejected)
Implicit reward trends
Learning rate scheduling
Log probability distributions
Step-by-step progress
Epoch progression
Log Configuration
You can customize logging behavior through the hyperparameters configuration:
Best Practices
Monitor reward margins: Check that the margin between chosen and rejected rewards increases over time - if margins stay flat or decrease, the model isn't learning to distinguish between preferred and non-preferred responses
Track loss patterns: Use TensorBoard to see DPO loss trends - the loss should decrease steadily; sudden spikes may indicate learning rate issues or data quality problems with chosen/rejected pairs
Debug preference learning: JSONL logs show reward statistics for each step - look for positive margins (chosen rewards higher than rejected), check if margins are too small (weak preferences), or if they fluctuate wildly (inconsistent data)
Monitor implicit rewards: Both chosen and rejected rewards should be reasonable - if rejected rewards are too high or chosen rewards are too low, your preference data may have labeling issues
Compare experiments: Save logs from each experiment to compare which hyperparameters and preference datasets produce the best alignment - use Tensorboard to view multiple experiments side-by-side
Upload model to cloud storage
When running experiments, we don’t always save the model directly to the cloud. Instead, we may first evaluate its performance before uploading it to cloud storage. To support this workflow, we provide a save_model function that allows you to upload the model as a separate step after fine tuning.
Configure environment variable (.env)
Fill in the AWS_ACCESS_KEY, AWS_SECRET_KEY and AWS_REGION fields. If you don’t have these keys, please contact the infrastructure team.
Upload model
To upload the model, you need to configure the storage configuration and specify the model path on save_model function. The model path should point to the directory of your best adapter model.
Last updated
Was this helpful?