Skip to content

Add Entropy Control to GRPOTrainer #3628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

1485840691
Copy link
Contributor

What does this PR do?

Fixes # (3320)
#3320

The initial step is to support static entropy control
Next step is to support adaptive entropy control

Before submitting

  • [ N] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ Y] Did you read the contributor guideline,
    Pull Request section?
  • [ Y] Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • [Y ] Did you make sure to update the documentation with your changes?
  • [ N] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@1485840691 1485840691 marked this pull request as draft June 22, 2025 08:39
@LeonEricsson
Copy link
Collaborator

LeonEricsson commented Jun 22, 2025

Note that there is a parallel PR (#3563) working on entropy based filtering, we're going to need to sync these

@1485840691 1485840691 reopened this Jun 27, 2025
@1485840691 1485840691 marked this pull request as ready for review June 29, 2025 09:22
@1485840691
Copy link
Contributor Author

@LeonEricsson @qgallouedec could you please help review the latest changes? Thanks

@1485840691 1485840691 requested a review from LeonEricsson July 24, 2025 06:58
Copy link
Collaborator

@LeonEricsson LeonEricsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work. A few comments on my end

@1485840691
Copy link
Contributor Author

Thanks for the work. A few comments on my end

Thanks for your comments. Resolved

@1485840691 1485840691 closed this Jul 25, 2025
@1485840691 1485840691 requested a review from LeonEricsson July 25, 2025 16:36
@1485840691 1485840691 reopened this Jul 25, 2025
Copy link
Collaborator

@LeonEricsson LeonEricsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few more comments

@@ -676,6 +720,18 @@ def __init__(
raise NotImplementedError(
"Liger Kernels don't currently support masking token positions based on entropy."
)
# Entropy loss weight
self.ent_coef = max(args.ent_coef, 0.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can allow the user to set a negative weight if they choose to. I don’t see a specific use case for it, but I don't see the harm in allowing it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user set a negative weight for it. Should we directly multiply the negative weight to the entropy loss?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Static coefficient of the entropy regularization term in the loss.
A positive coefficient adds an entropy bonus to encourage exploration.
It is also used as the initial entropy coefficient when using adaptive entropy control.
use_adapt_entropy (`bool`, *optional*, defaults to `False`):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for pettiness but can can we do

Suggested change
use_adapt_entropy (`bool`, *optional*, defaults to `False`):
use_adaptive_entropy (`bool`, *optional*, defaults to `False`):

Comment on lines +320 to +325
self.use_adapt_ent = use_adapt_ent
self.ent_coef = ent_coef
self.min_ent_coef = min_ent_coef
self.max_ent_coef = max_ent_coef
self.delta_ent_coef = delta_ent_coef
self.target_ent = target_ent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change these everywhere the same way we did in config, e.g entropy_coef_min. Also use entropy instead of ent throughout.

@LeonEricsson
Copy link
Collaborator

LeonEricsson commented Jul 28, 2025

While reviewing the updated entropy controller I noted the following issues, which I should of realized sooner, apologies for that.

  1. Hidden mutable state
    The class keeps an internal entropy coefficient that mutates every call. Because that state lives outside the trainer/optimizer stack, it’s easy to miss in tests, logs, or checkpoints and makes debugging non-deterministic behaviour harder.

  2. Distributed training
    Right now every rank updates the coefficient from its local entropy, so the values drift apart. That means different GPUs are optimising slightly different objectives. The paper intends for a single global coefficient.

I suggest moving ownership of the entropy coefficient parameter to GRPOTrainer, making the entropy controller a pure strategy object which holds logic to step the entropy coefficient (rename __call__ to step()), change name to EntropyScheduler, and use global entropy to step the entropy coefficient. Also broadcast coefficient to all ranks. Something like this (reduce() and broadcast() are placeholders)

entropy = agg_loss(...)

world_entropy = reduce(entropy.detach(), reduction="mean")

if self.accelerator.is_main_process:
    self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
    )                   

broadcast(self.ent_coef, src=0)

loss = loss - self.ent_coef * entropy_loss

@1485840691
Copy link
Contributor Author

While reviewing the updated entropy controller I noted the following issues, which I should of realized sooner, apologies for that.

  1. Hidden mutable state
    The class keeps an internal entropy coefficient that mutates every call. Because that state lives outside the trainer/optimizer stack, it’s easy to miss in tests, logs, or checkpoints and makes debugging non-deterministic behaviour harder.
  2. Distributed training
    Right now every rank updates the coefficient from its local entropy, so the values drift apart. That means different GPUs are optimising slightly different objectives. The paper intends for a single global coefficient.

I suggest moving ownership of the entropy coefficient parameter to GRPOTrainer, making the entropy controller a pure strategy object which holds logic to step the entropy coefficient (rename __call__ to step()), change name to EntropyScheduler, and use global entropy to step the entropy coefficient. Also broadcast coefficient to all ranks. Something like this (reduce() and broadcast() are placeholders)

entropy_loss = agg_loss(...)

world_entropy = reduce(entropy_loss.detach(), reduction="mean")

if self.accelerator.is_main_process:
    self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
    )                   

broadcast(self.ent_coef, src=0)

loss = loss - self.ent_coef * entropy_loss

Yes, I also think it might be better to use a global scheduler to schedule the update of entropy coef based on global entropy loss gathered from all ranks. I took a look at the original code of skywork and think that it might be using a per rank scheduler to control entropy coef. If you have time, could you please help confirm it?

The entropy loss apply entropy coef: https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/workers/actor/dp_actor.py#L234

The entropy controller defined inside trainer

https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/trainer/ppo/ray_trainer.py#L391

https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/trainer/ppo/ray_trainer.py#L1097C25-L1098C1

@LeonEricsson
Copy link
Collaborator

@qgallouedec would appreciate your thoughts on dealing with the stateful entropy coefficient. To recap, Adaptive Entropy Control maintains the entropy coefficient $\alpha_k$ as an adaptive (or running) coefficient, which is incrementally updated on each optimizer step based on the batches entropy. Is something like this sufficient for maintaining a global entropy coefficient?

entropy = agg_loss(...)

world_entropy = reduce(entropy.detach(), reduction="mean")

if self.accelerator.is_main_process:
    self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
    )                   

broadcast(self.entropy_coef)

loss = loss - self.entropy_coef * entropy_loss

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants