Skip to content

Commit 1cd2f15

Browse files
authored
update link (#60)
Co-authored-by: Ren Xuancheng <[email protected]>
1 parent 6bdd48f commit 1cd2f15

File tree

5 files changed

+12
-11
lines changed

5 files changed

+12
-11
lines changed

content/blog/gspo/clipping.jpg

-72.3 KB
Binary file not shown.

content/blog/gspo/index.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -60,9 +60,9 @@ Let $x$ be a query, $\pi_{\theta_\mathrm{old}}$ be the old policy that generates
6060

6161
{{< rawhtml >}}
6262
$$
63-
\mathcal{J}_\text{GSPO} (\theta)
64-
=
65-
\mathbb{E}_{ x \sim \mathcal{D},\, \{y_i\}_{i=1}^G \sim \pi_{\theta_\text{old}}( \cdot | x) }
63+
\mathcal{J}_\text{GSPO} (\theta)
64+
=\,
65+
\mathbb{E}_{ x \sim \mathcal{D},\, \{y_i\}_{i=1}^G \sim \pi_{\theta_\mathrm{old}}( \cdot | x) }
6666
\left[
6767
\frac{1}{G} \sum_{i=1}^{G}
6868
\min \left( s_{i}(\theta) \widehat{A}_{i}, \, \mathrm{clip} \left( s_{i}(\theta), 1 - {\varepsilon}, 1 + {\varepsilon} \right) \widehat{A}_{i} \right)
@@ -83,26 +83,27 @@ $$
8383
{{< /rawhtml >}}
8484

8585

86+
8687
Here, $s_i(\theta)$ is **the importance ratio defined based on sequence likelihood** in GSPO, where we perform length normalization to reduce variance and unify the numerical range of $s_i(\theta)$.
8788

8889
## Training Efficiency and Performance
8990

9091
We experiment with a cold-start model fine-tuned from Qwen3-30B-A3B-Base and report its training reward curves as well as performance curves on the AIME'24, LiveCodeBench, and CodeForces benchmarks. We compare against GRPO as the baseline. Note that GRPO necessitates the Routing Replay training strategy for the normal convergence of MoE RL (which we will discuss later), while **GSPO has obviated the need for this strategy**.
9192

92-
{{< figure src="results.jpg#center" title="Results">}}
93+
{{< figure src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/results.jpg#center" title="Experimental results">}}
9394

9495
As shown in the figure above, GSPO demonstrates **significantly higher training efficiency** than GRPO, achieving better performance under the same training cost. Particularly, we observe that **GSPO can deliver continuous performance improvement through increasing the training compute, regularly updating the query set, and extending the generation length** — this is exactly the **scalability** we expect from an algorithm. Ultimately, we successfully applied GSPO to the large-scale RL training of the latest Qwen3 models, further unleashing the potential of RL scaling!
9596

9697
An interesting observation is that the fraction of tokens clipped in GSPO is two orders of magnitude higher than that in GRPO (as shown in the figure below), while GSPO still achieves higher training efficiency. This further demonstrates that GRPO's token-level optimization objective is noisy and inefficient, while GSPO's sequence-level approach provides a more reliable and effective learning signal.
9798

98-
{{< figure src="clipping.jpg#center" title="Clipping">}}
99+
{{< figure src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/clipping.jpg" title="Fractions of clipped tokens">}}
99100

100101

101102
## Benefits for MoE RL and Infrastructure
102103

103104
We found that when adopting the GRPO algorithm, the expert activation volatility of MoE models prevents RL training from converging properly. To address this challenge, we previously employed the **Routing Replay** training strategy, which caches the activated experts in $\pi_{\theta_\text{old}}$ and "replays" these routing patterns in $\pi_\theta$ when computing importance ratios. As shown in the figure below, Routing Replay is crucial for normal convergence of GRPO training on MoE models. However, the Routing Replay strategy incurs additional memory and communication overhead and may limit the actual capacity of MoE models.
104105

105-
{{< figure src="routing_replay.jpg#center" title="Routing Replay">}}
106+
{{< figure src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/routing_replay.jpg" title="Effect of Routing Replay in the GRPO training of MoE models">}}
106107

107108
The notable advantage of GSPO lies in **completely eliminating the dependency on Routing Replay**. The key insight is that GSPO only focuses on sequence-level likelihood (i.e., $\pi_\theta(y_i|x)$) and is not sensitive to individual token likelihood (i.e., $\pi_\theta(y_{i,t}|x,y_{i,<t})$). Therefore, it does not require infrastructure-heavy workarounds like Routing Replay, both simplifying and stabilizing the training process while allowing models to maximize their capacity.
108109

content/blog/gspo/index.zh.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ show_word_count: true
5959
设 $x$ 为查询,$\pi_{\theta_\mathrm{old}}$ 为用于采样回复的策略,$\\{y_i\\}\_{i=1}^G$ 为采样得到的回复组,$\widehat{A}\_{i}$ 为各个回复的组内相对优势,$\pi_\theta$ 为需优化的当前策略。GSPO 采用以下优化目标:
6060

6161
{{< rawhtml >}}
62-
$$f
62+
$$
6363
\mathcal{J}_\text{GSPO} (\theta)
6464
=
6565
\mathbb{E}_{ x \sim \mathcal{D},\, \{y_i\}_{i=1}^G \sim \pi_{\theta_\text{old}}( \cdot | x) }
@@ -88,21 +88,21 @@ $$
8888

8989
我们使用基于 Qwen3-30B-A3B-Base 微调得到的冷启动模型进行实验,并汇报其训练奖励曲线以及在 AIME'24、LiveCodeBench 和 CodeForces 等基准上的性能曲线。我们对比 GRPO 作为基线。注意 GRPO 必需采用 Routing Replay 训练策略才能正常收敛(我们将在后文讨论),而 **GSPO 则无需该策略**
9090

91-
{{< figure src="results.jpg#center" title="结果">}}
91+
{{< figure src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/results.jpg#center" title="实验结果">}}
9292

9393

9494
从上图可见,GSPO 表现出比 GRPO **显著更高的训练效率**,即在同等计算开销下能够取得更优的性能。特别地,我们观察到 GSPO 可以**通过增加算力来获得持续的性能提升**——这正是我们所期待的算法的**可拓展性**。最终,我们成功地将 GSPO 应用于最新的 Qwen3 模型的大规模 RL 训练,进一步释放了 RL scaling 的潜能!
9595

9696
一个有趣的观察是,GSPO 所裁剪的 token 比例比 GRPO 要高上两个数量级(如下图所示),但却具有更高的训练效率。这进一步表明 GRPO 采用的 token 级别的优化目标是有噪和低效的,而 GSPO 的序列级别的优化目标则提供了更可靠、有效的学习信号。
9797

98-
{{< figure src="clipping.jpg#center" title="裁剪">}}
98+
{{< figure src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/clipping.jpg#center" title="被裁剪的 token 比例">}}
9999

100100

101101
## 对 MoE RL 和基础设施的收益
102102

103103
我们发现,当采用 GRPO 算法时,MoE 模型的专家激活波动性会使得 RL 训练无法正常收敛。为了解决这一挑战,我们过去采用了**路由回放(Routing Replay)**训练策略,即缓存 $\pi_{\theta_\text{old}}$ 中激活的专家,并在计算重要性比率时在 $\pi_\theta$ 中“回放”这些路由模式。从下图可见,Routing Replay 对于 GRPO 训练 MoE 模型的正常收敛至关重要。然而,Routing Replay 的做法会产生额外的内存和通信开销,并可能限制 MoE 模型的实际可用容量。
104104

105-
{{< figure src="routing_replay.jpg#center" title="Routing Replay">}}
105+
{{< figure src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/routing_replay.jpg#center" title="Routing Replay 对 MoE 模型的 GRPO 训练的作用">}}
106106

107107
GSPO 的一大突出优势在于**彻底消除了对 Routing Replay 的依赖**。其核心洞见在于:GSPO 仅关注序列级别的似然(即 $\pi_\theta(y_i|x)$),而对个别 token 的似然(即 $\pi_\theta(y_{i,t}|x,y_{i,<t})$)不敏感。因此,其无需 Routing Replay 等对基础设施负担较大的手段,既简化和稳定了训练过程,又使得模型能够最大化地发挥容量与潜能。
108108

@@ -118,7 +118,7 @@ GSPO 的一大突出优势在于**彻底消除了对 Routing Replay 的依赖**
118118

119119
```tex
120120
@article{gspo,
121-
title={Group Sequence Policy Optimization,
121+
title={Group Sequence Policy Optimization},
122122
author={
123123
Chujie Zheng and Shixuan Liu and Mingze Li and Xiong-Hui Chen and Bowen Yu and
124124
Chang Gao and Kai Dang and Yuqiong Liu and Rui Men and An Yang and Jingren Zhou and

content/blog/gspo/results.jpg

-285 KB
Binary file not shown.
-157 KB
Binary file not shown.

0 commit comments

Comments
 (0)