Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/BestPractices/Qwen3最佳实践.md
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,7 @@ swift rlhf \

Qwen3-235B-A22B-Instruct-250718 单机8卡H20 LoRA训练的最佳实践参考:[https://github.com/modelscope/ms-swift/pull/5033](https://github.com/modelscope/ms-swift/pull/5033)

ms-swift 引入了 Megatron 并行技术以加速大模型的CPT/SFT/DPO/KTO。支持的模型可以在[支持的模型文档](../Instruction/支持的模型和数据集.md)中找到。
ms-swift 引入了 Megatron 并行技术以加速大模型的CPT/SFT/DPO/KTO/RM。支持的模型可以在[支持的模型文档](../Instruction/支持的模型和数据集.md)中找到。

关于环境准备以及 HF 和 MCore 模型权重的转换,可以参考[Megatron-SWIFT训练文档](../Megatron-SWIFT/快速开始.md)

Expand Down
1 change: 1 addition & 0 deletions docs/source/Customization/自定义数据集.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@ alpaca格式:
- videos: video, videos.
- audios: audio, audios.
- 如果需要传入base64格式而不是文件路径,以下为样本例子:`"videos": ['data:video/mp4;base64,{base64_encoded}']`, `"images": ['data:image/jpg;base64,{base64_encoded}']`
- 若你希望直接传入视频帧,而不是视频,你可以使用以下格式(需"ms-swift>=3.8.3"):`"videos": [["/xxx/x.png", "/xxx/y.png"], ["/xxx/a.png", "/xxx/b.png", "/xxx/c.png"]]`。该格式只有部分模型支持,包括Qwen2/2.5/3-VL、Qwen2.5/3-Omni以及其衍生模型。

多模态模型的RLHF和序列分类的数据格式可以参考纯文本大模型的格式,并在此基础上增加`images`等字段。

Expand Down
2 changes: 1 addition & 1 deletion docs/source/GetStarted/快速开始.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架
- 量化训练:支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
- 🍊 RLHF训练:支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
- 🍓 多模态训练:支持对图像、视频和语音不同模态模型进行训练,支持VQA、Caption、OCR、Grounding任务的训练。
- 🥥 Megatron并行技术:支持使用Megatron并行技术对CPT/SFT/DPO/KTO进行加速,现支持200+大语言模型。
- 🥥 Megatron并行技术:支持使用Megatron并行技术对CPT/SFT/DPO/KTO/RM进行加速,现支持200+大语言模型。
- 界面训练:以界面的方式提供训练、推理、评测、量化的能力,完成大模型的全链路。
- 插件化与拓展:支持自定义模型和数据集拓展,支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
- 🍉 工具箱能力:除了对大模型和多模态大模型的训练支持外,还支持其推理、评测、量化和部署全流程。
Expand Down
1 change: 1 addition & 0 deletions docs/source/Instruction/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,7 @@ RLHF参数继承于[训练参数](#训练参数)。
- simpo_gamma: SimPO算法中的reward margin项,论文建议设置为0.5-1.5,默认为`1.`
- desirable_weight: KTO算法中用于抵消 desirable 和 undesirable 数量不均衡的影响,对 desirable 损失按该系数进行加权,默认为`1.`
- undesirable_weight: KTO算法中用于抵消 desirable 和 undesirable 数量不均衡的影响,对 undesirable 损失按该系数进行加权,默认为`1.`
- center_rewards_coefficient: 用于RM训练。用于激励奖励模型输出均值为零的奖励的系数,具体查看这篇[论文](https://huggingface.co/papers/2312.09244)。推荐值:0.01。
- loss_scale: 覆盖模板参数。rlhf训练时,默认为'last_round'。
- temperature: 默认为0.9,该参数将在PPO、GRPO、GKD中使用。
- lmbda: 默认为0.5。该参数在GKD中使用。控制学生数据比例的 lambda 参数(即策略内学生生成输出所占的比例)。若lmbda为0,则不使用学生生成数据。
Expand Down
12 changes: 8 additions & 4 deletions docs/source/Megatron-SWIFT/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@
- 🔥overlap_param_gather: 启用分布式优化器中参数all-gather的重叠(降低DP通信耗时)。默认为False。
- distributed_timeout_minutes: torch.distributed的timeout时间(单位为分钟),该参数失效,使用[基础参数](../Instruction/命令行参数.md#基本参数)中的ddp_timeout控制,默认为300000分钟。
- num_layers_per_virtual_pipeline_stage: 每个虚拟流水线阶段的层数。默认为None。该参数和`--num_virtual_stages_per_pipeline_rank`参数都可以用来设置vpp并行。
- num_virtual_stages_per_pipeline_rank: 每个流水线并行 rank 的虚拟流水线阶段数量。默认为None。vpp并行,用于减少pp并行的计算空泡,提高GPU利用率。
- num_virtual_stages_per_pipeline_rank: 每个流水线并行 rank 的虚拟流水线阶段数量。默认为None。vpp并行,用于减少pp并行的计算空泡,提高GPU利用率,但会略微提高通信量
- microbatch_group_size_per_virtual_pipeline_stage: 每个虚拟流水线阶段处理的连续微批次数量。默认为None,等于pipeline_model_parallel_size。
- 🔥pipeline_model_parallel_layout: 一个描述自定义流水线(pp/vpp)模型并行布局的字符串。例如:"E|(t|)*3,m|m||L"。其中 E、L、t、m 分别表示嵌入层(embedding)、损失层(loss)、Transformer 解码器层和 MTP 层。阶段之间用 "|" 分隔。重复的阶段或层可以通过乘法表示。逗号仅用于提升可读性(无实际语法作用)。默认值为 None,表示不使用此参数设置布局。
- 该参数通常在异构GPU集群上使用。
Expand Down Expand Up @@ -192,7 +192,7 @@
- moe_token_dispatcher_type: 要使用的token分发器类型。可选选项包括 'allgather'、'alltoall'、'flex'和'alltoall_seq'。默认值为'alltoall'。
- 🔥moe_grouped_gemm: 当每个rank包含多个专家时,通过在多个流中启动多个本地 GEMM 内核,利用 TransformerEngine中的GroupedLinear提高利用率和性能。默认为False。
- 🔥moe_permute_fusion: 在令牌分发过程中融合令牌重排操作。默认为False。
- 🔥moe_aux_loss_coeff: 默认为0,不使用aux_loss。
- 🔥moe_aux_loss_coeff: 默认为0,不使用aux_loss。**通常情况下,该值设置的越大,训练效果越差,但MoE负载越均衡**,请根据实验效果,选择合适的值。
- 注意:在"ms-swift<3.7.1",其默认为None,自动从config.json读取。
- moe_z_loss_coeff: z-loss 的缩放系数。默认为None。
- 🔥moe_shared_expert_overlap: 启用共享专家计算与调度器通信之间的重叠。如果不启用此选项,共享专家将在路由专家之后执行。仅在设置了`moe_shared_expert_intermediate_size`时有效。默认为False。
Expand Down Expand Up @@ -254,6 +254,10 @@ lora训练:
- desirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响,对 desirable 损失按该系数进行加权,默认为`1.`。
- undesirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响,对 undesirable 损失按该系数进行加权,默认为`1.`。

**RM参数**:
- center_rewards_coefficient: 用于激励奖励模型输出均值为零的奖励的系数,具体查看这篇[论文](https://huggingface.co/papers/2312.09244)。推荐值:0.01。


## 训练参数

Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用dataset、template等参数,也支持ms-swift中的特定模型参数**)。基本参数的内容可以参考[这里](../Instruction/命令行参数.md#基本参数)。此外还包括以下参数:
Expand All @@ -265,7 +269,7 @@ Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用
- mlp_padding_free: 默认为False。用于padding_free设置为false时,对mlp进行padding_free优化。这可以在自定义attention_mask的同时,提升训练速度和减少显存占用。
- vit_gradient_checkpointing: 多模态模型训练时,是否对vit部分开启gradient_checkpointing。默认为True。(**Megatron-SWIFT的vit实现使用transformers实现**)
- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。该参数只对`vit_gradient_checkpointing`生效。
- 🔥packing: 是否使用序列packing提升计算效率(不同节点与进程更负载均衡,GPU利用率更高;但需要额外的预处理时间)并稳定显存占用,默认为False。当前支持CPT/SFT/DPO/KTO。
- 🔥packing: 是否使用序列packing提升计算效率(不同节点与进程更负载均衡,GPU利用率更高;但需要额外的预处理时间)并稳定显存占用,默认为False。当前支持CPT/SFT/DPO/KTO/RM
- 注意:**同一batch的不同序列之间依旧是不可见的**,除了Qwen3-Next。
- packing_length: packing的长度。默认为None,设置为max_length。
- streaming: 流式读取并处理数据集,默认False。
Expand All @@ -285,6 +289,6 @@ Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用

## RLHF参数
除了继承训练参数外,还支持以下参数:
- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo''kto'。
- 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo''kto'和'rm'。
- loss_scale: 覆盖[基本参数](../Instruction/命令行参数.md)中的loss_scale。默认为'last_round'。
- calculate_per_token_loss: 覆盖Megatron参数,默认为False。
2 changes: 1 addition & 1 deletion docs/source/Megatron-SWIFT/多模态模型.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 多模态模型

ms-swift引入了Megatron的并行技术来加速多模态大模型的训练。目前支持Qwen3-VL, Qwen3-Omni, Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, GLM4.5v, Kimi-VL等模型的CPT/SFT/DPO/KTO。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)
ms-swift引入了Megatron的并行技术来加速多模态大模型的训练。目前支持Qwen3-VL, Qwen3-Omni, Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, GLM4.5v, Kimi-VL等模型的CPT/SFT/DPO/KTO/RM。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)

环境准备请参考Megatron-SWIFT的[快速开始文档](./快速开始.md)

Expand Down
3 changes: 2 additions & 1 deletion docs/source/Megatron-SWIFT/快速开始.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ ms-swift引入了Megatron的并行技术来加速大模型的训练,包括数
| 指令监督微调 |||||
| DPO |||||
| KTO |||||
| RM |||||
| 分类任务 |||||


Expand Down Expand Up @@ -162,7 +163,7 @@ I am a language model developed by swift, you can call me swift-robot. How can I


## 训练技巧
- 增加训练吞吐量方法:使用packing、增加DP、减少重计算、增加计算通信overlap。
- 增加训练吞吐量方法:使用packing、增加DP、减少重计算、增加计算通信overlap。MoE还可以通过丢弃tokens加速。
- 并行技术选择:
- Megatron-SWIFT的并行技术采用zero1(默认开启use_distributed_optimizer)+各种并行技术的组合。
- DP的速度最快,但显存占用较多,使用其他并行技术以降低显存占用。
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/BestPractices/Qwen3-Best-Practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,7 @@ swift rlhf \

Best practice reference for single-node 8xH20 LoRA training with Qwen3-235B-A22B-Instruct-250718: https://github.com/modelscope/ms-swift/pull/5033.

ms-swift introduces Megatron parallelism techniques to accelerate CPT/SFT/DPO/KTO for large models. Supported models can be found in the [Supported Models and Datasets Document](../Instruction/Supported-models-and-datasets.md).
ms-swift introduces Megatron parallelism techniques to accelerate CPT/SFT/DPO/KTO/RM for large models. Supported models can be found in the [Supported Models and Datasets Document](../Instruction/Supported-models-and-datasets.md).

For environment setup and conversion between HF and MCore model weights, refer to the [Megatron-SWIFT Training Documentation](../Megatron-SWIFT/Quick-start.md).

Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Customization/Custom-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,7 @@ Supervised Fine-tuning:
- videos: video, videos.
- audios: audio, audios.
- If you need to pass base64 data instead of file paths, here are sample examples: `"videos": ['data:video/mp4;base64,{base64_encoded}']`, `"images": ['data:image/jpg;base64,{base64_encoded}']`.
- If you wish to directly pass in video frames instead of a video file, you can use the following format (requires `ms-swift>=3.8.3`): `"videos": [["/xxx/x.png", "/xxx/y.png"], ["/xxx/a.png", "/xxx/b.png", "/xxx/c.png"]]`. This format is supported only by certain models, including Qwen2/2.5/3-VL, Qwen2.5/3-Omni, and their derivative models.

The data format for RLHF and sequence classification of multimodal models can reference the format of pure text large models, with additional fields such as `images` added on top of that.

Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/GetStarted/Quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ ms-swift is a comprehensive training and deployment framework for large language
- Quantization Training: Provides training for quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
- 🍊 RLHF Training: Supports human alignment training methods like DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, ORPO for both text-based and multimodal large models.
- 🍓 Multimodal Training: Capable of training models for different modalities such as images, videos, and audios; supports tasks like VQA (Visual Question Answering), Captioning, OCR (Optical Character Recognition), and Grounding.
- 🥥 Megatron Parallelism: Supports accelerating CPT/SFT/DPO/KTO using Megatron parallelism techniques, currently compatible with 200+ large language models.
- 🥥 Megatron Parallelism: Supports accelerating CPT/SFT/DPO/KTO/RM using Megatron parallelism techniques, currently compatible with 200+ large language models.
- Interface-driven Training: Offers training, inference, evaluation, and quantization capabilities through an interface, enabling a complete workflow for large models.
- Plugins and Extensions: Allows customization and extension of models and datasets, and supports customizations for components like loss, metric, trainer, loss-scale, callback, optimizer, etc.
- 🍉 Toolbox Capabilities: Offers not only training support for large models and multi-modal large models but also covers the entire process of inference, evaluation, quantization, and deployment.
Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -475,6 +475,7 @@ RLHF arguments inherit from the [training arguments](#training-arguments).
- simpo_gamma: Reward margin term in the SimPO algorithm, with a paper-suggested setting of 0.5-1.5, default is `1.`.
- desirable_weight: In the KTO algorithm, this weight compensates for the imbalance between the number of desirable and undesirable samples by scaling the desirable loss. Default is `1.0`.
- undesirable_weight: In the KTO algorithm, this weight compensates for the imbalance between desirable and undesirable samples by scaling the undesirable loss. Default is `1.0`.
- center_rewards_coefficient: A coefficient used in reward model (RM) training to incentivize the model to output rewards with zero mean. See this [paper](https://huggingface.co/papers/2312.09244) for details. Recommended value: 0.01.
- loss_scale: Overrides the template parameter. During RLHF training, the default is `'last_round'`.
- temperature: Default is 0.9; this parameter will be used in PPO, GRPO and GKD.
- lmbda: Default is 0.5. This parameter is used in GKD. It controls the lambda parameter for the proportion of student data (i.e., the proportion of student-generated outputs within the strategy). If lmbda is 0, student-generated data is not used.
Expand Down
Loading
Loading