Skip to content

Conversation

Jintao-Huang
Copy link
Collaborator

No description provided.

Copy link
Contributor

Summary of Changes

Hello @Jintao-Huang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Megatron-SWIFT framework by integrating support for Reward Model (RM) training. It expands the available RLHF training types to include RM, introduces a new parameter for controlling reward centering, and provides a dedicated trainer class for this purpose. The accompanying documentation has been thoroughly updated to guide users through the new RM training options.

Highlights

  • Reward Model (RM) Support: Added 'rm' as a new supported rlhf_type option, enabling the training of Reward Models within the framework.
  • New Configuration Parameter: Introduced center_rewards_coefficient, a new optional float parameter specifically for RM training, designed to incentivize the model to output rewards with a zero mean.
  • Dedicated Reward Trainer: A new MegatronRewardTrainer class has been added, inheriting from MegatronRLHFTrainer, to encapsulate the specific training logic for Reward Models. It currently requires padding_free to be enabled.
  • Documentation Updates: Updated both Chinese and English documentation files to reflect the new 'rm' rlhf_type and the center_rewards_coefficient parameter, ensuring users are aware of the new capabilities and configurations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Reward Model (RM) training to Megatron-SWIFT. The changes include introducing new arguments, updating documentation in both English and Chinese, and adding a new MegatronRewardTrainer. My review has identified a critical bug that will cause a crash when using the new 'rm' rlhf_type, and another critical issue where the new trainer has unimplemented placeholder methods. I've also noted several medium-severity issues, including a potential unhandled error case, documentation typos, and unused imports that affect code quality.

Comment on lines 23 to 27
def loss_func(self, output_tensor):
pass

def forward_step(self, data_iterator, model):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The loss_func and forward_step methods are not implemented; they just contain pass. This will lead to incorrect behavior at runtime, as no loss will be computed and no forward step logic will be executed. These methods must be implemented for the reward model trainer to be functional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The punctuation in the list of rlhf_type options is inconsistent. It uses both a full-width comma and a half-width comma. For consistency, it's better to use half-width commas for all separators in the list.

Suggested change
- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo''kto', 'rm'。
- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo', 'kto', 'rm'。

In addition to inheriting the training parameters, the following parameters are also supported:

- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo' and 'kto' are available.
- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', kto', and 'rm' are available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the list of available rlhf_type options. The option 'kto' is missing its leading single quote.

Suggested change
- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', kto', and 'rm' are available.
- 🔥rlhf_type: Default is 'dpo'. Currently, 'dpo', 'kto', and 'rm' are available.

Comment on lines 19 to 20
trainer_mapping = {'dpo': MegatronDPOTrainer, 'kto': MegatronKTOTrainer, 'rm': MegatronRewardTrainer}
trainer_cls = trainer_mapping[args.rlhf_type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While using a dictionary for mapping is a good refactoring, direct key access trainer_mapping[args.rlhf_type] will raise a KeyError if an unsupported rlhf_type is provided. The original implementation raised a more informative ValueError. It's better to use .get() and provide a clear error message for unhandled types.

Suggested change
trainer_mapping = {'dpo': MegatronDPOTrainer, 'kto': MegatronKTOTrainer, 'rm': MegatronRewardTrainer}
trainer_cls = trainer_mapping[args.rlhf_type]
trainer_mapping = {'dpo': MegatronDPOTrainer, 'kto': MegatronKTOTrainer, 'rm': MegatronRewardTrainer}
trainer_cls = trainer_mapping.get(args.rlhf_type)
if trainer_cls is None:
raise ValueError(f'The current Megatron-SWIFT does not support rlhf_type: {args.rlhf_type}.')

Comment on lines 2 to 11
from collections import namedtuple
from functools import partial
from typing import Literal

import torch
from megatron.core import mpu
from megatron.training import get_args, get_timers
from trl import KTOTrainer

from swift.utils import get_current_device, get_logger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are several unused imports in this file: namedtuple, partial, Literal, torch, mpu, get_args, get_timers, KTOTrainer, and get_current_device. Please remove them to improve code cleanliness.

@modelscope modelscope deleted a comment from gemini-code-assist bot Oct 12, 2025
@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Reward Model (RM) training to Megatron-SWIFT. The changes are comprehensive, including updates to documentation in both English and Chinese, modifications to argument parsing to include RM-specific parameters like center_rewards_coefficient, and the addition of a new MegatronRewardTrainer. The implementation of the reward trainer and the refactoring of existing code to accommodate the new rm rlhf_type are well done. I have a few suggestions to improve code clarity and fix a potential bug.

Comment on lines +15 to +19
def __post_init__(self):
if self.rlhf_type == 'rm':
self.task_type = 'seq_cls'
self.num_labels = 1
super().__post_init__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

While this __post_init__ logic correctly handles the 'rm' case, the rlhf_type argument for this class on line 10 is defined as Literal['dpo', 'kto']. This will prevent the 'rm' option from being used at all, as it will fail during argument parsing. Please update the Literal on line 10 to include 'rm' to make this new logic reachable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a mix of full-width () and half-width (,) commas in the list of options. For consistency, it's better to use only half-width commas within the code-formatted text to align with common coding style.

Suggested change
- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo''kto', 'rm'。
- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo', 'kto', 'rm'。

Comment on lines +27 to +30
if margin is not None:
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected - margin).mean()
else:
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The loss calculation logic is slightly duplicated for cases with and without a margin. You can simplify this by calculating the reward_diff first and then optionally subtracting the margin. This refactoring improves readability and reduces code duplication.

Suggested change
if margin is not None:
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected - margin).mean()
else:
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
reward_diff = rewards_chosen - rewards_rejected
if margin is not None:
reward_diff = reward_diff - margin
loss = -nn.functional.logsigmoid(reward_diff).mean()

@Jintao-Huang
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Reward Model (RM) training within the Megatron-SWIFT framework. This includes adding new arguments, a new trainer (MegatronRewardTrainer), example scripts, and updating documentation. The changes are well-structured and follow existing patterns in the codebase. I have one suggestion to improve the metric collection logic for consistency and to prevent potential memory issues.

Comment on lines 43 to 44
if self.args.center_rewards_coefficient is not None:
metric['center_rewards_loss'] = center_rewards_loss
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The center_rewards_loss tensor is added to the metric dictionary without being detached. This is inconsistent with how other metrics like loss are handled (loss.detach().clone()). While this might not cause an immediate issue if the downstream processing handles it, it's safer and better practice to detach tensors intended for logging to prevent them from being held in the computation graph, which could lead to increased memory usage.

Suggested change
if self.args.center_rewards_coefficient is not None:
metric['center_rewards_loss'] = center_rewards_loss
if self.args.center_rewards_coefficient is not None:
metric['center_rewards_loss'] = center_rewards_loss.detach().clone()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

请问megatron swift支持reward model的训练吗,有计划支持吗

2 participants