Dynamic sampling option in GRPO trainer based on DAPO paper #3758

almeidava93 · 2025-07-23T04:22:20Z

What does this PR do?

This PR adds the option to enable dynamic sampling in the GRPO trainer. Dynamic sampling was described in the DAPO paper and consists in oversampling prompts which have zero variance in their rewards. This makes the algorithm more efficient by avoiding training in samples with no learning signal. As stated in the DAPO paper, this procedure alone was able to increase their model accuracy on AIME24 (avg@32) from 42% to 50%.

Specifically, this contribution adds two GRPOConfig options:

use_dynamic_sampling: a boolean argument that enables dynamic sampling
max_num_samplings: an integer argument that provides a maximum number of resamplings.

This algorithm resamples completions from the model until a non-zero standard deviation is found in the computed rewards or until the maximum number of samplings is reached. GRPOConfig docstring was also updated.

Fixes #3708 ([Feature Request] support dynamic sampling for GRPO trainer)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

qgallouedec · 2025-07-23T05:41:56Z

Thanks for the PR!
Were you able to reproduce this result?

almeidava93 · 2025-07-23T05:58:33Z

Unfortunately not. They've used a 32B base model. I could try something similar with a much smaller model, but wouldn't be the same thing.

almeidava93 · 2025-07-23T23:22:29Z

I have a hardware limitation. I'm trying to reproduce the experiment with a llama 1B model, but this will take time. I would need to first do GRPO training and then retrain using dynamic sampling. But looking at their implementation, it is pretty similar to what I've added.

LeonEricsson · 2025-07-25T09:20:14Z

I would also like to see a wall-time comparison; dynamic sampling certainly increases sample efficiency, but since it requires re-sampling, does it actually result in a net gain in training time?

qgallouedec · 2025-07-25T16:11:49Z

Yes! it was the lack of information on this that prompted us to wait a bit before implementing it.

almeidava93 · 2025-07-25T16:36:14Z

Do you believe that an experiment with smaller models is enough? I'm talking about a 1B model trained with LoRA. I thought about using the same data they used in the paper for training and evaluation and compare the performance between GRPO only and GRPO + dynamic sampling

LeonEricsson · 2025-07-26T08:02:53Z

Do you believe that an experiment with smaller models is enough? I'm talking about a 1B model trained with LoRA. I thought about using the same data they used in the paper for training and evaluation and compare the performance between GRPO only and GRPO + dynamic sampling

That sounds like a solid starting point. Depending on the results, we might consider running tests on larger models later. I can potentially handle that if necessary, but please proceed with the initial experiment first, would be great!

almeidava93 · 2025-07-28T13:11:42Z

I have some results to share. I did two experiments using Llama 3.2 1B model with LoRA, one with standard GRPO, another with GRPO + dynamic sampling.

In the paper, they mentioned that they resampled the model until every sample had rewards with a standard deviation different from zero. I had to limit resampling to a maximum due to hardware limitations. It was set to 10. The maximum number of generated tokens was set to 1000 tokens. Other techniques presented in the paper (clip-higher, Token-level Loss, Soft Overlong Punishment) were not applied. Also, to try to answer your question regarding training time, the GRPO training was done for 1000 training steps and GRPO + dynamic sampling was done for the same time, regardless of the number of training steps.

Performance on AIME-2024:

Llama 1B GRPO - 1000 training steps: 0.0052 %
Llama 1B GRPO + dynamic sampling - 250 training steps: 0.0031%

Total training time was about 7h30 in total for each experiment. Here's the hardware I'm using: 13th Gen Intel(R) Core(TM) i9-13900K (3.00 GHz, 24 cores), 64 GB DDR5 RAM, and a NVIDIA GeForce RTX 4090 GPU with 24 GB VRAM.

Different from my finding, they specifically state that training time was not longer with dynamic sampling. I wonder what kind of optimizations should I do to replicate their setup. What are your thoughts?

One possible optimization would be oversampling instead of only resampling completions.

LeonEricsson · 2025-07-28T22:28:55Z

Am I understanding you correctly that with just max_num_samplings=10, training was 4× slower in terms of steps?

almeidava93 · 2025-07-28T22:32:04Z

Yep, that was my finding. But I didn't use a couple of strategies to reduce the need for resampling (oversampling) or speeding up inference (with vllm). Do you have any suggestions on how to tackle this considering the hardware limitation I have?

I believe this result may also change depending on what base model you have. If the base model has a higher probability of generating a correct answer, the need for resampling is lower and the training may become more efficient as well. This Llama 3.2 1B model is very limited if compared with the Qwen 32B they've used.

dynamic sampling config and loop setup

93b0a74

almeidava93 mentioned this pull request Jul 23, 2025

[Feature Request] support dynamic sampling for GRPO trainer #3708

Open

Merge branch 'main' into dynamic-sampling-option-in-grpo-trainer

c92a74e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic sampling option in GRPO trainer based on DAPO paper #3758

Dynamic sampling option in GRPO trainer based on DAPO paper #3758

Uh oh!

almeidava93 commented Jul 23, 2025

Uh oh!

qgallouedec commented Jul 23, 2025

Uh oh!

almeidava93 commented Jul 23, 2025

Uh oh!

almeidava93 commented Jul 23, 2025 •

edited

Loading

Uh oh!

LeonEricsson commented Jul 25, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Jul 25, 2025

Uh oh!

almeidava93 commented Jul 25, 2025

Uh oh!

LeonEricsson commented Jul 26, 2025

Uh oh!

almeidava93 commented Jul 28, 2025 •

edited

Loading

Uh oh!

LeonEricsson commented Jul 28, 2025 •

edited

Loading

Uh oh!

almeidava93 commented Jul 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Dynamic sampling option in GRPO trainer based on DAPO paper #3758

Are you sure you want to change the base?

Dynamic sampling option in GRPO trainer based on DAPO paper #3758

Uh oh!

Conversation

almeidava93 commented Jul 23, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

qgallouedec commented Jul 23, 2025

Uh oh!

almeidava93 commented Jul 23, 2025

Uh oh!

almeidava93 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeonEricsson commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec commented Jul 25, 2025

Uh oh!

almeidava93 commented Jul 25, 2025

Uh oh!

LeonEricsson commented Jul 26, 2025

Uh oh!

almeidava93 commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeonEricsson commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

almeidava93 commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

almeidava93 commented Jul 23, 2025 •

edited

Loading

LeonEricsson commented Jul 25, 2025 •

edited

Loading

almeidava93 commented Jul 28, 2025 •

edited

Loading

LeonEricsson commented Jul 28, 2025 •

edited

Loading

almeidava93 commented Jul 28, 2025 •

edited

Loading