Nvidia‘s new open-weight model reached gold medal-level performance at three elite global competitions — using only 3 billion active parameters — and the company has published the post-training method that made it possible.
Nemotron-Cascade 2 is a 30B parameter Mixture-of-Experts model that activates just 3B parameters at inference time. According to the announcement, it achieved gold medal scores at the 2025 International Mathematical Olympiad, the International Olympiad in Informatics, and the ICPC World Finals. It is only the second open model to reach that tier, after DeepSeek-V3.2-Speciale — a model with 20 times more active parameters.
The model starts from the same base as Nvidia‘s existing Nemotron-3-Nano. The technical report states it outperforms that model on nearly every benchmark and, in many cases, outperforms Nvidia‘s own Nemotron-3-Super, which carries four times the active parameters. No new base model was required. The performance gap comes entirely from what happens after pre-training.
The Cascade RL Pipeline
Reinforcement learning has become the standard technique for improving reasoning in large language models, but applying it across multiple domains simultaneously creates a well-documented problem: gains in one area erode performance in another. The Cascade RL pipeline addresses this through strict sequential training — one domain at a time, in a deliberate order.
For Nemotron-Cascade 2, that sequence runs as follows: instruction-following RL first, then multi-domain RL covering STEM questions, tool calling, and structured output, followed by on-policy distillation, then RLHF for human preference alignment, then long-context RL, then code RL, and finally software engineering RL. Instruction-following comes first because it can conflict with human preference alignment, which the report says can be recovered later in the sequence. Code and software engineering stages are placed last because they perform best there.
The technical report identifies three practical advantages to this approach. Domain-specific RL stages prove resistant to catastrophic forgetting — code training rarely degrades math performance and sometimes improves it. Single-domain stages allow hyperparameters and training curricula to be tuned precisely for each domain’s characteristics. And because responses within one domain tend to be similar in length and verification cost, compute utilization runs substantially more efficiently than mixed-domain training.
The ordering is not universal. The report states it depends on the specific model’s behavior, meaning teams applying this method would need to evaluate sequencing for their own use cases.
MOPD: Rebalancing Mid-Pipeline
Sequential training solves the interference problem but introduces another: as a model passes through many RL stages, performance on earlier domains can drift. Nvidia‘s solution is Multi-Domain On-Policy Distillation, or MOPD, inserted partway through the pipeline.
MOPD identifies the best-performing checkpoint for each individual domain across all intermediate training states — the math checkpoint might peak after supervised fine-tuning, the instruction-following checkpoint after its dedicated RL stage — then uses those checkpoints collectively to regenerate a balanced dataset and retrain the current model against it. The effect is a recalibration of capabilities without discarding the gains already made downstream.
Both Cascade RL and MOPD are detailed in Nvidia‘s published technical report, which the company describes as a reproducible blueprint. Pre-training a frontier model from scratch costs tens to potentially hundreds of millions of dollars. The report’s core implication is that teams working with existing base models may achieve substantially better results by investing in post-training methodology rather than model scale.
Photo by Logan Voss on Unsplash
This article is a curated summary based on third-party sources. Source: Read the original article