You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here, $s_i(\theta)$ is **the importance ratio defined based on sequence likelihood** in GSPO, where we perform length normalization to reduce variance and unify the numerical range of $s_i(\theta)$.
87
88
88
89
## Training Efficiency and Performance
89
90
90
91
We experiment with a cold-start model fine-tuned from Qwen3-30B-A3B-Base and report its training reward curves as well as performance curves on the AIME'24, LiveCodeBench, and CodeForces benchmarks. We compare against GRPO as the baseline. Note that GRPO necessitates the Routing Replay training strategy for the normal convergence of MoE RL (which we will discuss later), while **GSPO has obviated the need for this strategy**.
As shown in the figure above, GSPO demonstrates **significantly higher training efficiency** than GRPO, achieving better performance under the same training cost. Particularly, we observe that **GSPO can deliver continuous performance improvement through increasing the training compute, regularly updating the query set, and extending the generation length** — this is exactly the **scalability** we expect from an algorithm. Ultimately, we successfully applied GSPO to the large-scale RL training of the latest Qwen3 models, further unleashing the potential of RL scaling!
95
96
96
97
An interesting observation is that the fraction of tokens clipped in GSPO is two orders of magnitude higher than that in GRPO (as shown in the figure below), while GSPO still achieves higher training efficiency. This further demonstrates that GRPO's token-level optimization objective is noisy and inefficient, while GSPO's sequence-level approach provides a more reliable and effective learning signal.
{{< figure src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/clipping.jpg" title="Fractions of clipped tokens">}}
99
100
100
101
101
102
## Benefits for MoE RL and Infrastructure
102
103
103
104
We found that when adopting the GRPO algorithm, the expert activation volatility of MoE models prevents RL training from converging properly. To address this challenge, we previously employed the **Routing Replay** training strategy, which caches the activated experts in $\pi_{\theta_\text{old}}$ and "replays" these routing patterns in $\pi_\theta$ when computing importance ratios. As shown in the figure below, Routing Replay is crucial for normal convergence of GRPO training on MoE models. However, the Routing Replay strategy incurs additional memory and communication overhead and may limit the actual capacity of MoE models.
{{< figure src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/routing_replay.jpg" title="Effect of Routing Replay in the GRPO training of MoE models">}}
106
107
107
108
The notable advantage of GSPO lies in **completely eliminating the dependency on Routing Replay**. The key insight is that GSPO only focuses on sequence-level likelihood (i.e., $\pi_\theta(y_i|x)$) and is not sensitive to individual token likelihood (i.e., $\pi_\theta(y_{i,t}|x,y_{i,<t})$). Therefore, it does not require infrastructure-heavy workarounds like Routing Replay, both simplifying and stabilizing the training process while allowing models to maximize their capacity.
0 commit comments