Nvidia Nemotron 3 Super: 120B Hybrid Model Beats GPT-OSS

alex2404
By
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

Nvidia released Nemotron 3 Super, a 120-billion-parameter hybrid model, on March 11, 2026, with weights published on Hugging Face under mostly open terms for commercial use.

The release targets a specific cost problem. Multi-agent systems handling long-horizon tasks — software engineering, cybersecurity triage — can generate up to 15 times the token volume of standard chat interactions, eroding the economics of enterprise deployment.

Three Architectures in One Model

The model fuses three distinct architectural approaches. A Hybrid Mamba-Transformer backbone interleaves Mamba-2 layers with selective Transformer attention layers. The Mamba-2 layers handle sequence processing at linear-time complexity, supporting a 1-million-token context window without the memory cost of a standard KV cache. Transformer layers are inserted as precision anchors for associative recall — retrieving specific facts from deep within codebases or financial documents.

On top of that backbone sits Latent Mixture-of-Experts (LatentMoE). Standard MoE routes tokens to specialists at full hidden dimension, creating bottlenecks at scale. LatentMoE compresses tokens before routing, allowing the model to consult four times as many specialists at the same computational cost.

The third component is Multi-Token Prediction. Rather than predicting one token at a time, the model predicts several future tokens simultaneously, functioning as a built-in draft model. According to the announcement, this delivers up to 3x wall-clock speedups on structured generation tasks like code and tool calls.

Blackwell-Native Performance

Nvidia pre-trained the model natively in NVFP4 (4-bit floating point), optimized for the Blackwell GPU platform. On Blackwell hardware, the model delivers 4x faster inference than 8-bit models running on the previous Hopper architecture, with no reported accuracy loss.

The model currently holds the top position on the DeepResearch Bench, which measures multi-step research capability across large document sets.

Benchmark results show a mixed picture against Qwen3.5-122B-A10B and GPT-OSS-120B. Nemotron 3 Super leads on several tests: HMMT Feb25 with tools (94.73) versus Qwen’s 89.55, RULER @ 512k (95.67) against GPT-OSS’s 52.30, and SWE-Bench Multilingual via OpenHands (45.78) compared to GPT-OSS’s 30.80. It scores 60.47 on SWE-Bench via OpenHands against GPT-OSS’s 41.9.

Qwen outperforms it on several others — including GPQA no tools (86.60 vs 79.23), HLE no tools (25.30 vs 18.26), and TauBench V2 Telecom (95.00 vs 64.36). GPT-OSS leads on Arena-Hard-V2 (90.26) against Nemotron’s 73.88.

The throughput advantage is the primary argument Nvidia makes for the model’s enterprise fit — not across-the-board benchmark dominance, but inference speed and token efficiency at the scale agentic pipelines demand.

Photo by 🇻🇪 Jose G. Ortega Castro 🇲🇽 on Unsplash

This article is a curated summary based on third-party sources. Source: Read the original article

Share This Article