newsence
來源篩選

@MiniMax_AI: Q: Why choose CISPO instead of GSPO or GRPO? How well does CISPO adapt to MoE, and does changing the...

Twitter

Q: Why choose CISPO instead of GSPO or GRPO? How well does CISPO adapt to MoE, and does changing the RL algorithm require architectural refactoring? GRPO predates both, but in our attempts to reproduce R1-Zero it proved unreliable: PPO-style clipping caused token-level gradients to vanish, leading to unstable learning. GSPO can be a reasonable alternative in some settings, and the Meta paper provides a useful comparison. We chose CISPO primarily for its empirical stability and favorable bias–variance trade-off. Regarding MoE compatibility, our observations so far indicate that CISPO behaves similarly on MoE and dense models. At the algorithmic level, we do not see major discrepancies introduced by MoE when using CISPO. As for architecture changes, switching RL algorithms does not require refactoring the core model architecture. That said, MoE models do introduce additional considerations during RL training, mainly due to the router mechanism. Recent approaches (such as R3 with fixed routing) aim to improve MoE stability under RL. These are largely lower-level implementation choices and are mostly orthogonal to higher-level RL algorithms like CISPO, which primarily operate at the level of optimization dynamics rather than architectural design.

newsence

Loading

Fetching article data