Skip to content

Dynamic sampling option in GRPO trainer based on DAPO paper #3758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

almeidava93
Copy link

What does this PR do?

This PR adds the option to enable dynamic sampling in the GRPO trainer. Dynamic sampling was described in the DAPO paper and consists in oversampling prompts which have zero variance in their rewards. This makes the algorithm more efficient by avoiding training in samples with no learning signal. As stated in the DAPO paper, this procedure alone was able to increase their model accuracy on AIME24 (avg@32) from 42% to 50%.

Specifically, this contribution adds two GRPOConfig options:

  • use_dynamic_sampling: a boolean argument that enables dynamic sampling
  • max_num_samplings: an integer argument that provides a maximum number of resamplings.

This algorithm resamples completions from the model until a non-zero standard deviation is found in the computed rewards or until the maximum number of samplings is reached. GRPOConfig docstring was also updated.

Fixes #3708 ([Feature Request] support dynamic sampling for GRPO trainer)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@qgallouedec
Copy link
Member

Thanks for the PR!
Were you able to reproduce this result?

@almeidava93
Copy link
Author

Unfortunately not. They've used a 32B base model. I could try something similar with a much smaller model, but wouldn't be the same thing.

@almeidava93
Copy link
Author

almeidava93 commented Jul 23, 2025

I have a hardware limitation. I'm trying to reproduce the experiment with a llama 1B model, but this will take time. I would need to first do GRPO training and then retrain using dynamic sampling. But looking at their implementation, it is pretty similar to what I've added.

@LeonEricsson
Copy link
Collaborator

LeonEricsson commented Jul 25, 2025

I would also like to see a wall-time comparison; dynamic sampling certainly increases sample efficiency, but since it requires re-sampling, does it actually result in a net gain in training time?

@qgallouedec
Copy link
Member

Yes! it was the lack of information on this that prompted us to wait a bit before implementing it.

@almeidava93
Copy link
Author

Do you believe that an experiment with smaller models is enough? I'm talking about a 1B model trained with LoRA. I thought about using the same data they used in the paper for training and evaluation and compare the performance between GRPO only and GRPO + dynamic sampling

@LeonEricsson
Copy link
Collaborator

Do you believe that an experiment with smaller models is enough? I'm talking about a 1B model trained with LoRA. I thought about using the same data they used in the paper for training and evaluation and compare the performance between GRPO only and GRPO + dynamic sampling

That sounds like a solid starting point. Depending on the results, we might consider running tests on larger models later. I can potentially handle that if necessary, but please proceed with the initial experiment first, would be great!

@almeidava93
Copy link
Author

almeidava93 commented Jul 28, 2025

I have some results to share. I did two experiments using Llama 3.2 1B model with LoRA, one with standard GRPO, another with GRPO + dynamic sampling.

In the paper, they mentioned that they resampled the model until every sample had rewards with a standard deviation different from zero. I had to limit resampling to a maximum due to hardware limitations. It was set to 10. The maximum number of generated tokens was set to 1000 tokens. Other techniques presented in the paper (clip-higher, Token-level Loss, Soft Overlong Punishment) were not applied. Also, to try to answer your question regarding training time, the GRPO training was done for 1000 training steps and GRPO + dynamic sampling was done for the same time, regardless of the number of training steps.

Performance on AIME-2024:

  • Llama 1B GRPO - 1000 training steps: 0.0052 %
  • Llama 1B GRPO + dynamic sampling - 250 training steps: 0.0031%

Total training time was about 7h30 in total for each experiment. Here's the hardware I'm using: 13th Gen Intel(R) Core(TM) i9-13900K (3.00 GHz, 24 cores), 64 GB DDR5 RAM, and a NVIDIA GeForce RTX 4090 GPU with 24 GB VRAM.

Different from my finding, they specifically state that training time was not longer with dynamic sampling. I wonder what kind of optimizations should I do to replicate their setup. What are your thoughts?

One possible optimization would be oversampling instead of only resampling completions.

@LeonEricsson
Copy link
Collaborator

LeonEricsson commented Jul 28, 2025

Am I understanding you correctly that with just max_num_samplings=10, training was 4× slower in terms of steps?

@almeidava93
Copy link
Author

almeidava93 commented Jul 28, 2025

Yep, that was my finding. But I didn't use a couple of strategies to reduce the need for resampling (oversampling) or speeding up inference (with vllm). Do you have any suggestions on how to tackle this considering the hardware limitation I have?

I believe this result may also change depending on what base model you have. If the base model has a higher probability of generating a correct answer, the need for resampling is lower and the training may become more efficient as well. This Llama 3.2 1B model is very limited if compared with the Qwen 32B they've used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] support dynamic sampling for GRPO trainer
3 participants