[megatron] support reward model #6093

Jintao-Huang · 2025-10-12T09:17:33Z

No description provided.

gemini-code-assist · 2025-10-12T09:17:49Z

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Megatron-SWIFT framework by integrating support for Reward Model (RM) training. It expands the available RLHF training types to include RM, introduces a new parameter for controlling reward centering, and provides a dedicated trainer class for this purpose. The accompanying documentation has been thoroughly updated to guide users through the new RM training options.

Highlights

Reward Model (RM) Support: Added 'rm' as a new supported rlhf_type option, enabling the training of Reward Models within the framework.
New Configuration Parameter: Introduced center_rewards_coefficient, a new optional float parameter specifically for RM training, designed to incentivize the model to output rewards with a zero mean.
Dedicated Reward Trainer: A new MegatronRewardTrainer class has been added, inheriting from MegatronRLHFTrainer, to encapsulate the specific training logic for Reward Models. It currently requires padding_free to be enabled.
Documentation Updates: Updated both Chinese and English documentation files to reflect the new 'rm' rlhf_type and the center_rewards_coefficient parameter, ensuring users are aware of the new capabilities and configurations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for Reward Model (RM) training to Megatron-SWIFT. The changes include introducing new arguments, updating documentation in both English and Chinese, and adding a new MegatronRewardTrainer. My review has identified a critical bug that will cause a crash when using the new 'rm' rlhf_type, and another critical issue where the new trainer has unimplemented placeholder methods. I've also noted several medium-severity issues, including a potential unhandled error case, documentation typos, and unused imports that affect code quality.

gemini-code-assist · 2025-10-12T09:19:35Z

swift/megatron/trainers/reward_trainer.py

+    def loss_func(self, output_tensor):
+        pass
+
+    def forward_step(self, data_iterator, model):
+        pass


The loss_func and forward_step methods are not implemented; they just contain pass. This will lead to incorrect behavior at runtime, as no loss will be computed and no forward step logic will be executed. These methods must be implemented for the reward model trainer to be functional.

gemini-code-assist · 2025-10-12T09:19:35Z

docs/source/Megatron-SWIFT/命令行参数.md

The punctuation in the list of rlhf_type options is inconsistent. It uses both a full-width comma and a half-width comma. For consistency, it's better to use half-width commas for all separators in the list.

Suggested change

- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo'，'kto', 'rm'。

- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo', 'kto', 'rm'。

gemini-code-assist · 2025-10-12T09:19:36Z

docs/source_en/Megatron-SWIFT/Command-line-parameters.md

 In addition to inheriting the training parameters, the following parameters are also supported:

- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo' and 'kto' are available.
+- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', kto', and 'rm' are available.


There is a typo in the list of available rlhf_type options. The option 'kto' is missing its leading single quote.

Suggested change

- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', kto', and 'rm' are available.

- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', 'kto', and 'rm' are available.

gemini-code-assist · 2025-10-12T09:19:36Z

swift/megatron/train/rlhf.py

+        trainer_mapping = {'dpo': MegatronDPOTrainer, 'kto': MegatronKTOTrainer, 'rm': MegatronRewardTrainer}
+        trainer_cls = trainer_mapping[args.rlhf_type]


While using a dictionary for mapping is a good refactoring, direct key access trainer_mapping[args.rlhf_type] will raise a KeyError if an unsupported rlhf_type is provided. The original implementation raised a more informative ValueError. It's better to use .get() and provide a clear error message for unhandled types.

Suggested change

trainer_mapping = {'dpo': MegatronDPOTrainer, 'kto': MegatronKTOTrainer, 'rm': MegatronRewardTrainer}

trainer_cls = trainer_mapping[args.rlhf_type]

trainer_mapping = {'dpo': MegatronDPOTrainer, 'kto': MegatronKTOTrainer, 'rm': MegatronRewardTrainer}

trainer_cls = trainer_mapping.get(args.rlhf_type)

if trainer_cls is None:

raise ValueError(f'The current Megatron-SWIFT does not support rlhf_type: {args.rlhf_type}.')

gemini-code-assist · 2025-10-12T09:19:36Z

swift/megatron/trainers/reward_trainer.py

+from collections import namedtuple
+from functools import partial
+from typing import Literal
+
+import torch
+from megatron.core import mpu
+from megatron.training import get_args, get_timers
+from trl import KTOTrainer
+
+from swift.utils import get_current_device, get_logger


There are several unused imports in this file: namedtuple, partial, Literal, torch, mpu, get_args, get_timers, KTOTrainer, and get_current_device. Please remove them to improve code cleanliness.

Jintao-Huang · 2025-10-12T11:54:53Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for Reward Model (RM) training to Megatron-SWIFT. The changes are comprehensive, including updates to documentation in both English and Chinese, modifications to argument parsing to include RM-specific parameters like center_rewards_coefficient, and the addition of a new MegatronRewardTrainer. The implementation of the reward trainer and the refactoring of existing code to accommodate the new rm rlhf_type are well done. I have a few suggestions to improve code clarity and fix a potential bug.

gemini-code-assist · 2025-10-12T11:56:41Z

swift/megatron/argument/rlhf_args.py

+    def __post_init__(self):
+        if self.rlhf_type == 'rm':
+            self.task_type = 'seq_cls'
+            self.num_labels = 1
+        super().__post_init__()


While this __post_init__ logic correctly handles the 'rm' case, the rlhf_type argument for this class on line 10 is defined as Literal['dpo', 'kto']. This will prevent the 'rm' option from being used at all, as it will fail during argument parsing. Please update the Literal on line 10 to include 'rm' to make this new logic reachable.

gemini-code-assist · 2025-10-12T11:56:41Z

docs/source/Megatron-SWIFT/命令行参数.md

There's a mix of full-width (，) and half-width (,) commas in the list of options. For consistency, it's better to use only half-width commas within the code-formatted text to align with common coding style.

Suggested change

- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo'，'kto', 'rm'。

- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo', 'kto', 'rm'。

gemini-code-assist · 2025-10-12T11:56:42Z

swift/megatron/trainers/reward_trainer.py

+        if margin is not None:
+            loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected - margin).mean()
+        else:
+            loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()


The loss calculation logic is slightly duplicated for cases with and without a margin. You can simplify this by calculating the reward_diff first and then optionally subtracting the margin. This refactoring improves readability and reduces code duplication.

Suggested change

if margin is not None:

loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected - margin).mean()

else:

loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()

reward_diff = rewards_chosen - rewards_rejected

if margin is not None:

reward_diff = reward_diff - margin

loss = -nn.functional.logsigmoid(reward_diff).mean()

Jintao-Huang · 2025-10-12T16:25:40Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for Reward Model (RM) training within the Megatron-SWIFT framework. This includes adding new arguments, a new trainer (MegatronRewardTrainer), example scripts, and updating documentation. The changes are well-structured and follow existing patterns in the codebase. I have one suggestion to improve the metric collection logic for consistency and to prevent potential memory issues.

gemini-code-assist · 2025-10-12T16:27:32Z

swift/megatron/trainers/reward_trainer.py

+        if self.args.center_rewards_coefficient is not None:
+            metric['center_rewards_loss'] = center_rewards_loss


The center_rewards_loss tensor is added to the metric dictionary without being detached. This is inconsistent with how other metrics like loss are handled (loss.detach().clone()). While this might not cause an immediate issue if the downstream processing handles it, it's safer and better practice to detach tensors intended for logging to prevent them from being held in the computation graph, which could lead to increased memory usage.

Suggested change

if self.args.center_rewards_coefficient is not None:

metric['center_rewards_loss'] = center_rewards_loss

if self.args.center_rewards_coefficient is not None:

metric['center_rewards_loss'] = center_rewards_loss.detach().clone()

update

a923aba

gemini-code-assist bot reviewed Oct 12, 2025

View reviewed changes

update

0422b99

modelscope deleted a comment from gemini-code-assist bot Oct 12, 2025

Jintao-Huang added 2 commits October 12, 2025 19:51

update

425cb97

revert

545a65a

Jintao-Huang linked an issue Oct 12, 2025 that may be closed by this pull request

请问megatron swift支持reward model的训练吗，有计划支持吗 #6066

Open

gemini-code-assist bot reviewed Oct 12, 2025

View reviewed changes

fix

b169bc9

hjh0119 approved these changes Oct 12, 2025

View reviewed changes

Jintao-Huang added 3 commits October 12, 2025 22:23

update

613bc67

update

09f88af

update

b5f016f

gemini-code-assist bot reviewed Oct 12, 2025

View reviewed changes

Jintao-Huang added 5 commits October 13, 2025 00:28

update

396a282

update

48ed0f3

fix

196167c

update

ff3e5d5

update

cdc1678

	- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo'，'kto', 'rm'。
	- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo', 'kto', 'rm'。

	- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', kto', and 'rm' are available.
	- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', 'kto', and 'rm' are available.

		trainer_mapping = {'dpo': MegatronDPOTrainer, 'kto': MegatronKTOTrainer, 'rm': MegatronRewardTrainer}
		trainer_cls = trainer_mapping[args.rlhf_type]

		if self.args.center_rewards_coefficient is not None:
		metric['center_rewards_loss'] = center_rewards_loss

[megatron] support reward model #6093

Are you sure you want to change the base?

[megatron] support reward model #6093

Uh oh!

Conversation

Jintao-Huang commented Oct 12, 2025

Uh oh!

gemini-code-assist bot commented Oct 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Oct 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Oct 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants