Skip to content

Releases: huggingface/trl

v0.20.0

29 Jul 04:59
30576d2
Compare
Choose a tag to compare

Breaking and major changes

🎞️ GSPO

GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token.

Screenshot 2025-07-28 at 10 54 15 PM

📜 Paper: https://huggingface.co/papers/2507.18071

To reproduce the paper's setting, use this configuration:

from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence",
    loss_type="grpo",
    steps_per_generation=...,
    beta=0.04,  # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
    epsilon=3e-4,  # https://x.com/ChujieZheng/status/1948933507696525392
)

by @qgallouedec in #3775

👁️ [GRPO] Add VLM training capabilities to the GRPO trainer

Group 291-4

The GRPOTrainer can now be used for VLM training. Give a try with this dummy example:

from trl import GRPOTrainer
from datasets import load_dataset

# Dummy vision-language dataset
dataset = load_dataset("trl-internal-testing/zen-image", "conversational_prompt_only", split="train")

# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
    return [len(set(c[0]["content"])) for c in completions]

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    reward_funcs=[reward_num_unique_chars],
    train_dataset=dataset,
)

trainer.train()

by @CompN3rd and @kashif in #3072 in #3760

🐙 MPO

Screenshot 2025-07-28 at 10 52 15 PM

The DPO trainer supports combining multiple loss functions with different weights, enabling more sophisticated optimization strategies. This is particularly useful for implementing algorithms like MPO (Mixed Preference Optimization). MPO is a training approach that combines multiple optimization objectives, as described in the paper Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.

To combine multiple losses, specify the loss types and corresponding weights as lists:

from trl import DPOConfig

# MPO: Combines DPO (sigmoid) for preference and BCO (bco_pair) for quality
training_args = DPOConfig(
    loss_type=["sigmoid", "bco_pair", "sft"],  # Loss types to combine
    loss_weights=[0.8, 0.2, 1.0]  # Corresponding weights, as used in the MPO paper
)

by @qgallouedec in #2544

Add support for CB with native transformers

Continuous Batching allows for faster generation using the transformers backend. You can now use it with the GRPOTrainer by setting use_transformers_paged=True in the config.

use_transformers_paged = True
from trl import GRPOConfig
training_args = GRPOConfig(
    # ... other args
    use_transformers_paged=Ture,
)

by @ArthurZucker in #3471

Add entropy based filtering inside the GRPOTrainer

Screenshot 2025-07-28 at 10 27 20 PM

In Beyond the 80/20 Rule: High-Entropy Minority Tokens
Drive Effective Reinforcement Learning for LLM Reasoning
, it is shown that utilizing only 20% of the highest entropy tokens leads to similar performance as using all tokens. You can now enable this feature in the GRPOTrainer by setting entropy_filtering=True in the config.

from trl import GRPOConfig

training_args = GRPOConfig(
    # ... other args
    top_entropy_quantile=0.2,  # Use only the top 20% of tokens based on entropy
)

by @pramodith in #3563

👐 FSDP2+GRPO

GRPO now supports FSDP2 training. Just run your script with an FSDP2 config:

accelerate launch --config_file examples/accelerate_configs/fsdp2.yaml run_grpo.py

by @SalmanMohammadi in #3687

What's Changed

Read more

v0.19.1

08 Jul 01:07
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.19.0...v0.19.1

v0.19.0

21 Jun 14:04
5b3ea9d
Compare
Choose a tag to compare

Breaking and major changes

🧰 [SFT] Tool support

SFTTrainer now supports training with tools! You just have to add a column tools to your dataset, which contains a list of tool definitions as json schemas. The tools will be automatically registered and can be used in the training process.

from datasets import Dataset
from transformers.utils import get_json_schema
from trl import SFTTrainer

# Fictitious functions to simulate tool calls
def start_timer(duration: int) -> int:
    """
    Starts a timer for the specified duration in seconds.

    Args:
        duration: Duration in seconds to set the timer for.

    Returns:
        The duration set for the timer.
    """
    return duration

def create_reminder(time: str, note: str) -> str:
    """
    Creates a reminder for the specified time and note.

    Args:
        time: The time for the reminder.
        note: The note for the reminder.

    Returns:
        A confirmation message indicating that the reminder has been set.
    """
    return "I'll remind you to call mom at 7 PM."

# Define the JSON schemas for the tools
start_timer = get_json_schema(start_timer)
create_reminder = get_json_schema(create_reminder)

dataset = Dataset.from_dict({
    "messages": [
        [
            {"role": "user", "content": "Set a timer for 10 minutes."},
            {"role": "assistant", "tool_calls": [{"type": "function", "function": {"name": "start_timer", "arguments": {"duration": 600}}}]},
            {"role": "tool", "name": "start_timer", "content": "600"},
            {"role": "assistant", "content": "Timer set for 10 minutes."},
        ],
        ...,
    ],
    "tools": [
        [start_timer, create_reminder],
        ...,
    ]
})

# Initialize the trainer
trainer = SFTTrainer(model="Qwen3-0.6B", train_dataset=dataset)

# Train the model
trainer.train()

by @qgallouedec in #3597

📉 FFD packing

We introduce a new packing method: FFD (First Fit Decreasing) packing. This method is designed to optimize the packing of sequences in a way that more efficiently reduces the size of the training dataset by grouping examples more effectively. Previously, we used a wrapped packing method, which often truncated sequences even when they were not longer than the maximum sequence length. The new FFD packing method avoids unnecessary truncation by grouping sequences more intelligently. This new packing strategy is now the default when packing is enabled.

training_args = SFTConfig(..., packing=True)

by @qgallouedec in #3521 and accelerated by @mariosasko in #3537

[Liger] liger DPO support

The DPOTrainer now supports the Liger-powered DPO loss, enabling faster training with lower memory usage.

training_args = DPOConfig(..., use_liger_loss=True)

by @kashif in #2568

💬 Fix setup_chat_format and add clone_chat_template

We introduce clone_chat_template, a more convenient and flexible function for setting up chat templates from any tokenizer that already includes one. It handles EOS tokens and copies all added tokens from the source tokenizer, preserving their "special" status.
You can either use this function directly:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import clone_chat_template

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

model, tokenizer = clone_chat_template(model, tokenizer, "Qwen/Qwen3-4B")

or use the chat_template_path parameter in SFTConfig to specify a chat template, which will be automatically cloned when the SFTTrainer is initialized.

from trl import SFTConfig

training_args = SFTConfig(chat_template_path="Qwen/Qwen3-4B")

by @qgallouedec in #3404 and #3599

📚 SFTTrainer support chat template kwargs

SFTTrainer now supports passing additional keyword arguments to the chat template. This allows for more flexibility in customizing the chat format during training. To enable it, just add a chat_template_kwargs column to your your dataset.

example = {'messages': [{'content': 'What is better than ugly?', 'role': 'user'},
                        {'content': 'Beautiful.', 'role': 'assistant'}]
           'chat_template_kwargs': {'my_template_arg': 'my_value'}}

by @qgallouedec in #3609

🤵‍♂️ SFT on assistant messages only

The SFTTrainer now supports training on assistant messages only

example = {'messages': [
    {'role': 'user', 'content': 'What is better than ugly?'},          # masked in the loss
    {'role': 'assistant', 'content': 'Beautiful.'},                    # used in the loss
    {'role': 'user', 'content': 'And what is better than implicit?'},  # masked in the loss
    {'role': 'assistant', 'content': 'Explicit.'},                     # used in the loss
]}

by @qgallouedec in #3586

🧬 Add generation_kwargs as a property of GRPOConfig to support additional generation arguments

The GRPOConfig now includes a generation_kwargs property, allowing users to specify additional generation arguments for the GRPOTrainer. This allows for further customization of the generation behavior, such as setting suppress_tokens, num_beams, etc.
Depending on the generation backend used (transformers or vLLM), this property will be passed either to transformers.GenerationConfig (if using transformers) or vllm.SamplingParams (if using vLLM).

from trl import GRPOConfig

training_args = GRPOConfig(..., generation_kwargs={"length_penalty": -0.1})

by @pramodith in #3617

New defaults

Minor changes

  • Add support for IterableDataset in DPO Trainer by @h-tonywu in #3559
  • 🔖 Fix: ensure user-provided labels are retained in self._signature_columns by @sxndqc in #3589
  • ⭐ Add vllm_gpu_memory_utilization recommendation script by @toslali-ibm in #3554

What's Changed

Read more

v0.18.2

15 Jun 22:15
Compare
Choose a tag to compare

What's Changed

  • 🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 by @qgallouedec in #3568

Full Changelog: v0.18.1...v0.18.2

v0.18.1

03 Jun 01:31
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.18.0...v0.18.1

v0.18.0

28 May 01:56
ef4b0b2
Compare
Choose a tag to compare

Major or breaking

What's Changed

New Contributors

Full Changelog: v0.17.0...v0.18.0

v0.17.0

25 Apr 02:20
cd6b3de
Compare
Choose a tag to compare

Major and breaking

The TRL v0.17 release introduces three major changes that, together, enable significantly faster generation performance in GRPO—up to 10x faster in some configurations.

autonlp-08

These three changes are:

  • Data parallelism (DP) for the vLLM server
  • A new GRPO training strategy that generates once per effective batch
  • Support for the V1 engine in vLLM

Below, we provide a summary of these changes and how to use them.

⚡ Up to 4x faster: Data Parallel for vLLM server

The TRL vLLM server now supports data parallelism (DP), enabling significantly faster generation speeds—especially for smaller models. This new feature can be used by adding the --data_parallel_size N argument when launching the vLLM server.

trl vllm-serve --model Qwen/Qwen2.5-14B-Instruct --tensor_parallel_size 2 --data_parallel_size 2

by @qgallouedec in #3310

* ☝️ [GRPO] Generate once per effective batch

Previously, GRPO made one generation request per global batch. The global batch is the total of all local batches, without accounting for gradient accumulation. In other words, if the gradient accumulation step was 8, GRPO would make 8 generation requests per training step.

Now, GRPO groups these global batches into a single "effective batch" and makes only one generation request per effective batch. Since vLLM applies optimizations that are especially effective for large batches, this new approach leads to significantly faster training overall.

No changes are required in the training script, as this is handled internally by the GRPO trainer.

Untitled-2025-04-08-0623-2

by @qgallouedec in #3283

⏱️ Fix vLLM server to support V1 Engine

vLLM provides two versions of its engine (V0 and V1), and V1 is significantly faster. This version is now supported by TRL and requires vLLM version 0.8.3 or higher.

by @I-l-l-I in #3276

👎 [GRPO] Adds option to disable dropout

Disabling dropout has shown to stabilize training. You can now disable dropout in GRPO by setting the disable_dropout argument to False in the GRPO config.

from trl import GRPOConfig

training_args = GRPOConfig(..., disable_dropout=True)

by @edbeeching in #3234

🩺 Dr. GRPO loss

GRPO now supports the various losses proposed in the recent literature, including the Dr. GRPO loss. The loss type can be set in the GRPO config:

from trl import GRPOConfig

training_args = GRPOConfig(..., loss_type="dr_grpo")

by @qgallouedec in #3256

🎲 [GRPO] Make training dataset shuffle optional

The GRPO trainer now has an option to disable shuffling of the training dataset. This is useful for curriculum learning, where the order of the training data is important.

from trl import GRPOConfig

training_args = GRPOConfig(..., shuffle_dataset=False)

by @LeonEricsson in #3334

☕ Overlong-filtering for GRPO

Overlong filtering has been shown to significantly stabilize learning and improve performance. You can now use it in TRL!

It simply consists in masking the loss of truncated samples

from trl import GRPOConfig

training_args = GRPOConfig(..., mask_truncated_completions=True)

Untitled-2025-04-08-0623

by @shirinyamani in #3248

🐯 Integrate Liger GRPO Loss to GRPO Trainer

Liger allows to significantly reduce the memory peak of the loss computation. You can now use it in TRL with the use_liger_loss argument in the GRPO config:

from trl import GRPOConfig

training_args = GRPOConfig(..., use_liger_loss=True)

by @shivam15s in #3184

Bug fixes

What's Changed

Read more

v0.16.1

04 Apr 18:23
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.16.0...v0.16.1

v0.16.0

22 Mar 21:18
23a635e
Compare
Choose a tag to compare

Major and breaking

🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication

Previously, vLLM could only be used by dedicating a single GPU, preventing both the scalability benefits of vLLM and multi-node training. This limitation has now been removed!

GRPO can now scale efficiently with models exceeding 70B parameters, supporting multi-node training with super-fast performance.

autonlp-08-16

To take advantage of this, simply launch a vLLM server using the following command:

trl vllm-serve --model <model_name> --tensor_parallel_size <tp_size>

Then, start GRPO training with use_vllm=True.

Below is a comparison of GRPO throughput with and without vLLM, across different TP values and model sizes.

@binary-husky and @qgallouedec in #3094

🐦‍🔥 6x faster GRPO with multi-step optimization

This release introduces the multi-step trick, which allows for the reuse of generated data across multiple steps, speeding up the training process.

To support this, we've implemented importance sampling and clipping logic. This enhancement should lead to significant improvements in training speed.

Screenshot 2025-03-23 at 14 52 28

To use it, simply set num_iterations to a value greater than 1.

training_args = GRPOConfig(..., num_iterations=4)

by @qgallouedec in #2899

🌍 Use global normalization in GRPO

As demonstrated in Dr GRPO, sequence-level normalization can introduce a response level length bias.

GmsxibSaoAAevq_

To address this, we have now switched to normalizing the loss and by the total number of tokens in the batch, ensuring more consistent and unbiased training.

- loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
+ loss = (per_token_loss * completion_mask).sum() / completion_mask.sum()

by @edbeeching in #2881

⚖️ Add option not to scale rewards

As demonstrated in Dr GRPO, scaling rewards can introduce a question-level difficulty bias. To address this, we have now added an option to disable reward scaling in GRPO.

training_args = GRPOConfig(..., scale_rewards=False)
  advantages = rewards - mean_grouped_rewards
- advantages = advantages / std_grouped_rewards
+ if self.args.scale_rewards:
+     advantages = advantages / std_grouped_rewards

it's likely that we'll make this (scale_rewards=False) the default behavior in the future.

by @qgallouedec in #3135

🤸‍♀️ Domain-specific rewards in GRPO

When optimizing across multiple domains, not all reward functions are relevant for every sample. For example, a math verifier's reward does not apply to grammar samples, and a grammar verifier's reward does not apply to math samples.

It is now possible to return None for rewards that do not make sense for a given sample. For instance, when the domain is specified in a column like domain, you can implement it as follows:

GmcKjsgaQAAawqj

def math_reward(completions, domain, **kwargs):
    rewards = []
    for completion, dom in zip(completions, domain):
        if dom == "math":
            rewards.append(verify(completion))
        else:
            rewards.append(None)
    return rewards

This allows for more domain-specific reward handling, ensuring that irrelevant rewards are ignored and don’t interfere with optimization.

by @shirinyamani in #3079

🍃 Do not load reference model when beta == 0.0

It has been observed that not minimizing the KL divergence between the trained model and the reference model can still yield good results, while significantly reducing memory usage and compute. This is because there is no need to store the reference model in memory or perform a forward pass for it.

When beta is set to 0.0, the reference model is not loaded, and the KL divergence is not computed, leading to savings in both time and memory.

training_args = GRPOConfig(..., beta=0.0)

by @ingambe in #2806

🕊️ Padding-free for SFT

Padding-free batching is an alternative approach to packing for reducing memory usage. In this method, a batch is first sampled and then flattened into a single sequence, avoiding padding. Unlike packing, which can result in incomplete sequences by combining parts of different samples, padding-free batching ensures that all sequences remain complete and intact.

To enable padding-free batching in SFT, simply set padding_free=True in the SFTConfig, and make sure to use flash_attention2 as the attention implementation.

training_args = SFTConfig(..., padding_free=True, model_init_kwargs={"attn_implementation": "flash_attention2"})

by @qgallouedec in #3076

🎬 Clip Higher for Better Exploration

As outlined in the DAPO paper, increasing the upper bound epsilon leads to higher entropy during generation, promoting better exploration. To enable this, we’ve added support for adjusting the upper bound epsilon directly in the default GRPO trainer.

training_args = GRPOConfig(epsilon_high=0.28)

by @shirinyamani in #3118

Bug fixes

Minor

What's Changed

  • [SFT] fix check for AutoLigerKernelForCausalLM by @kashif in #2874
  • 🆙 Bump vLLM min version to 0.7.2 by @edbeeching in #2860
  • [GRPO] Fix loss normalization by @edbeeching in #2881
  • 💬 Add maybe_convert_to_chatml map for conversational datasets in SFT by @kashif in #2862
  • 🧶 [GRPO][vLLM + LoRA] Move unmerge of PEFT model after weight loading by @XZ-X in #2873
  • 🍟 [SFT] Handles the dataset if it has been preprocessed by @BenasdTW in #2863
  • Optimize vllm num_generations ...
Read more

v0.15.2

25 Feb 22:40
Compare
Choose a tag to compare

What changed

Full Changelog: v0.15.1...v0.15.2