-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Dynamic sampling option in GRPO trainer based on DAPO paper #3758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Dynamic sampling option in GRPO trainer based on DAPO paper #3758
Conversation
Thanks for the PR! |
Unfortunately not. They've used a 32B base model. I could try something similar with a much smaller model, but wouldn't be the same thing. |
I have a hardware limitation. I'm trying to reproduce the experiment with a llama 1B model, but this will take time. I would need to first do GRPO training and then retrain using dynamic sampling. But looking at their implementation, it is pretty similar to what I've added. |
I would also like to see a wall-time comparison; dynamic sampling certainly increases sample efficiency, but since it requires re-sampling, does it actually result in a net gain in training time? |
Yes! it was the lack of information on this that prompted us to wait a bit before implementing it. |
Do you believe that an experiment with smaller models is enough? I'm talking about a 1B model trained with LoRA. I thought about using the same data they used in the paper for training and evaluation and compare the performance between GRPO only and GRPO + dynamic sampling |
That sounds like a solid starting point. Depending on the results, we might consider running tests on larger models later. I can potentially handle that if necessary, but please proceed with the initial experiment first, would be great! |
I have some results to share. I did two experiments using Llama 3.2 1B model with LoRA, one with standard GRPO, another with GRPO + dynamic sampling. In the paper, they mentioned that they resampled the model until every sample had rewards with a standard deviation different from zero. I had to limit resampling to a maximum due to hardware limitations. It was set to 10. The maximum number of generated tokens was set to 1000 tokens. Other techniques presented in the paper (clip-higher, Token-level Loss, Soft Overlong Punishment) were not applied. Also, to try to answer your question regarding training time, the GRPO training was done for 1000 training steps and GRPO + dynamic sampling was done for the same time, regardless of the number of training steps. Performance on AIME-2024:
Total training time was about 7h30 in total for each experiment. Here's the hardware I'm using: 13th Gen Intel(R) Core(TM) i9-13900K (3.00 GHz, 24 cores), 64 GB DDR5 RAM, and a NVIDIA GeForce RTX 4090 GPU with 24 GB VRAM. Different from my finding, they specifically state that training time was not longer with dynamic sampling. I wonder what kind of optimizations should I do to replicate their setup. What are your thoughts? One possible optimization would be oversampling instead of only resampling completions. |
Am I understanding you correctly that with just |
Yep, that was my finding. But I didn't use a couple of strategies to reduce the need for resampling (oversampling) or speeding up inference (with vllm). Do you have any suggestions on how to tackle this considering the hardware limitation I have? I believe this result may also change depending on what base model you have. If the base model has a higher probability of generating a correct answer, the need for resampling is lower and the training may become more efficient as well. This Llama 3.2 1B model is very limited if compared with the Qwen 32B they've used. |
What does this PR do?
This PR adds the option to enable dynamic sampling in the GRPO trainer. Dynamic sampling was described in the DAPO paper and consists in oversampling prompts which have zero variance in their rewards. This makes the algorithm more efficient by avoiding training in samples with no learning signal. As stated in the DAPO paper, this procedure alone was able to increase their model accuracy on AIME24 (avg@32) from 42% to 50%.
Specifically, this contribution adds two GRPOConfig options:
use_dynamic_sampling
: a boolean argument that enables dynamic samplingmax_num_samplings
: an integer argument that provides a maximum number of resamplings.This algorithm resamples completions from the model until a non-zero standard deviation is found in the computed rewards or until the maximum number of samplings is reached.
GRPOConfig
docstring was also updated.Fixes #3708 ([Feature Request] support dynamic sampling for GRPO trainer)
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.