Add Entropy Control to GRPOTrainer #3628

1485840691 · 2025-06-22T08:39:24Z

What does this PR do?

Fixes # (3320)
#3320

The initial step is to support static entropy control
Next step is to support adaptive entropy control

Before submitting

[ N] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ Y] Did you read the contributor guideline,
Pull Request section?
[ Y] Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
[Y ] Did you make sure to update the documentation with your changes?
[ N] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

LeonEricsson · 2025-06-22T09:11:15Z

Note that there is a parallel PR (#3563) working on entropy based filtering, we're going to need to sync these

trl/trainer/grpo_config.py

trl/trainer/grpo_trainer.py

Tgt ent control

1485840691 · 2025-07-15T09:14:55Z

@LeonEricsson @qgallouedec could you please help review the latest changes? Thanks

LeonEricsson

Thanks for the work. A few comments on my end

trl/trainer/grpo_trainer.py

trl/trainer/grpo_config.py

1485840691 · 2025-07-25T16:36:01Z

Thanks for the work. A few comments on my end

Thanks for your comments. Resolved

LeonEricsson

few more comments

trl/trainer/grpo_config.py

trl/trainer/grpo_trainer.py

trl/trainer/grpo_config.py

LeonEricsson · 2025-07-27T18:40:07Z

trl/trainer/grpo_trainer.py

@@ -676,6 +720,18 @@ def __init__(
            raise NotImplementedError(
                "Liger Kernels don't currently support masking token positions based on entropy."
            )
+        # Entropy loss weight
+        self.ent_coef = max(args.ent_coef, 0.0)


I think we can allow the user to set a negative weight if they choose to. I don’t see a specific use case for it, but I don't see the harm in allowing it

What if the user set a negative weight for it. Should we directly multiply the negative weight to the entropy loss?

trl/trainer/grpo_trainer.py

Co-authored-by: LeonEricsson <[email protected]>

LeonEricsson · 2025-07-28T11:34:15Z

trl/trainer/grpo_config.py

+            Static coefficient of the entropy regularization term in the loss.
+            A positive coefficient adds an entropy bonus to encourage exploration.
+            It is also used as the initial entropy coefficient when using adaptive entropy control.
+        use_adapt_entropy (`bool`, *optional*, defaults to `False`):


sorry for pettiness but can can we do

Suggested change

use_adapt_entropy (`bool`, *optional*, defaults to `False`):

use_adaptive_entropy (`bool`, *optional*, defaults to `False`):

LeonEricsson · 2025-07-28T11:36:07Z

trl/trainer/grpo_trainer.py

+        self.use_adapt_ent = use_adapt_ent
+        self.ent_coef = ent_coef
+        self.min_ent_coef = min_ent_coef
+        self.max_ent_coef = max_ent_coef
+        self.delta_ent_coef = delta_ent_coef
+        self.target_ent = target_ent


change these everywhere the same way we did in config, e.g entropy_coef_min. Also use entropy instead of ent throughout.

LeonEricsson · 2025-07-28T12:50:25Z

While reviewing the updated entropy controller I noted the following issues, which I should of realized sooner, apologies for that.

Hidden mutable state
The class keeps an internal entropy coefficient that mutates every call. Because that state lives outside the trainer/optimizer stack, it’s easy to miss in tests, logs, or checkpoints and makes debugging non-deterministic behaviour harder.
Distributed training
Right now every rank updates the coefficient from its local entropy, so the values drift apart. That means different GPUs are optimising slightly different objectives. The paper intends for a single global coefficient.

I suggest moving ownership of the entropy coefficient parameter to GRPOTrainer, making the entropy controller a pure strategy object which holds logic to step the entropy coefficient (rename __call__ to step()), change name to EntropyScheduler, and use global entropy to step the entropy coefficient. Also broadcast coefficient to all ranks. Something like this (reduce() and broadcast() are placeholders)

entropy = agg_loss(...)

world_entropy = reduce(entropy.detach(), reduction="mean")

if self.accelerator.is_main_process:
    self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
    )                   

broadcast(self.ent_coef, src=0)

loss = loss - self.ent_coef * entropy_loss

1485840691 · 2025-07-29T05:10:19Z

While reviewing the updated entropy controller I noted the following issues, which I should of realized sooner, apologies for that.

Hidden mutable state
The class keeps an internal entropy coefficient that mutates every call. Because that state lives outside the trainer/optimizer stack, it’s easy to miss in tests, logs, or checkpoints and makes debugging non-deterministic behaviour harder.

Distributed training
Right now every rank updates the coefficient from its local entropy, so the values drift apart. That means different GPUs are optimising slightly different objectives. The paper intends for a single global coefficient.

I suggest moving ownership of the entropy coefficient parameter to GRPOTrainer, making the entropy controller a pure strategy object which holds logic to step the entropy coefficient (rename __call__ to step()), change name to EntropyScheduler, and use global entropy to step the entropy coefficient. Also broadcast coefficient to all ranks. Something like this (reduce() and broadcast() are placeholders)
entropy_loss = agg_loss(...)

world_entropy = reduce(entropy_loss.detach(), reduction="mean")

if self.accelerator.is_main_process:
    self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
    )                   

broadcast(self.ent_coef, src=0)

loss = loss - self.ent_coef * entropy_loss

Yes, I also think it might be better to use a global scheduler to schedule the update of entropy coef based on global entropy loss gathered from all ranks. I took a look at the original code of skywork and think that it might be using a per rank scheduler to control entropy coef. If you have time, could you please help confirm it?

The entropy loss apply entropy coef: https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/workers/actor/dp_actor.py#L234

The entropy controller defined inside trainer

https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/trainer/ppo/ray_trainer.py#L391

https://github.com/SkyworkAI/Skywork-OR1/blob/64e96afa213ae89d0ad21932106d3b8aafe9ace2/verl/trainer/ppo/ray_trainer.py#L1097C25-L1098C1

LeonEricsson · 2025-07-29T08:35:43Z

@qgallouedec would appreciate your thoughts on dealing with the stateful entropy coefficient. To recap, Adaptive Entropy Control maintains the entropy coefficient $\alpha_k$ as an adaptive (or running) coefficient, which is incrementally updated on each optimizer step based on the batches entropy. Is something like this sufficient for maintaining a global entropy coefficient?

entropy = agg_loss(...)

world_entropy = reduce(entropy.detach(), reduction="mean")

if self.accelerator.is_main_process:
    self.entropy_coef = self.entropy_scheduler.step(
        self.entropy_coef, world_entropy
    )                   

broadcast(self.entropy_coef)

loss = loss - self.entropy_coef * entropy_loss

1485840691 added 5 commits June 21, 2025 08:25

add entropy loss

41f67a2

add entropy loss to metrics

619b598

Merge branch 'main' of https://github.com/1485840691/trl

bfad506

re-format

07a59c9

use F

c4b2eee

1485840691 marked this pull request as draft June 22, 2025 08:39

1485840691 mentioned this pull request Jun 23, 2025

Add entropy based filtering inside the GRPOTrainer. #3563

Merged

5 tasks

Output alignment

93e049d

1485840691 closed this Jun 26, 2025

1485840691 force-pushed the main branch from 93e049d to 7e8ef86 Compare June 26, 2025 01:45

1485840691 added 2 commits June 27, 2025 03:52

merge commits

af13140

ent coef not equal 0

42d030a

1485840691 reopened this Jun 27, 2025

1485840691 and others added 5 commits June 27, 2025 04:18

fix format

d45f0fa

Merge branch 'main' into main

2302f48

fix ent loss log

c729673

fix mode

03ad8da

Merge branch 'main' of https://github.com/1485840691/trl

7b3c95c

1485840691 marked this pull request as ready for review June 29, 2025 09:22

Merge branch 'main' into main

91da19e

LeonEricsson reviewed Jul 1, 2025

View reviewed changes

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

1485840691 and others added 8 commits July 3, 2025 08:40

update based on review

1b30552

Merge branch 'main' of https://github.com/1485840691/trl

f7b4f3c

Merge branch 'main' into main

e73f820

adaptive entropy control

b022c79

adaptive entropy control update

97de806

adaptive entropy control update fmt

a99244c

Merge pull request #1 from 1485840691/tgt_ent

7641827

Tgt ent control

refactor loss in grpo

32d5c7c

1485840691 and others added 2 commits July 9, 2025 06:09

Merge branch 'main' of https://github.com/1485840691/trl

6a531ba

Merge branch 'main' into main

2fbf4da

1485840691 added 2 commits July 16, 2025 20:14

Merge branch 'main' into main

44c342b

Merge branch 'main' into main

5b70fee

1485840691 requested a review from LeonEricsson July 24, 2025 06:58

LeonEricsson reviewed Jul 24, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

trl/trainer/grpo_config.py Outdated Show resolved Hide resolved

1485840691 and others added 3 commits July 25, 2025 20:32

Merge branch 'main' into main

7c05fb6

update based on feedback

72fa76b

Merge branch 'main' into main

b3acd9b

1485840691 closed this Jul 25, 2025

1485840691 requested a review from LeonEricsson July 25, 2025 16:36

1485840691 reopened this Jul 25, 2025

LeonEricsson reviewed Jul 27, 2025

View reviewed changes

1485840691 and others added 10 commits July 28, 2025 14:22

Update trl/trainer/grpo_config.py

88fa118

Co-authored-by: LeonEricsson <[email protected]>

Update trl/trainer/grpo_config.py

776cdd2

Co-authored-by: LeonEricsson <[email protected]>

Update trl/trainer/grpo_config.py

7e97263

Co-authored-by: LeonEricsson <[email protected]>

Update trl/trainer/grpo_config.py

8c9cd01

Co-authored-by: LeonEricsson <[email protected]>

Update trl/trainer/grpo_config.py

f1e1da6

Co-authored-by: LeonEricsson <[email protected]>

Update trl/trainer/grpo_config.py

340f711

Co-authored-by: LeonEricsson <[email protected]>

Update trl/trainer/grpo_config.py

4c85a76

Co-authored-by: LeonEricsson <[email protected]>

Update trl/trainer/grpo_trainer.py

ade3a3e

Co-authored-by: LeonEricsson <[email protected]>

Merge branch 'main' into main

51f8ca9

add test and update based on feedback

eded222

LeonEricsson reviewed Jul 28, 2025

View reviewed changes

Merge branch 'main' into main

e4ace46

	use_adapt_entropy (`bool`, optional, defaults to `False`):
	use_adaptive_entropy (`bool`, optional, defaults to `False`):

Add Entropy Control to GRPOTrainer #3628

Are you sure you want to change the base?

Add Entropy Control to GRPOTrainer #3628

Uh oh!

Conversation

1485840691 commented Jun 22, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

LeonEricsson commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

1485840691 commented Jul 15, 2025

Uh oh!

LeonEricsson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

1485840691 commented Jul 25, 2025

Uh oh!

LeonEricsson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeonEricsson Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

1485840691 Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

LeonEricsson Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LeonEricsson Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

LeonEricsson Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

LeonEricsson commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1485840691 commented Jul 29, 2025

Uh oh!

LeonEricsson commented Jul 29, 2025

Uh oh!

Uh oh!

LeonEricsson commented Jun 22, 2025 •

edited

Loading

LeonEricsson commented Jul 28, 2025 •

edited

Loading