TL;DR
NVIDIA's Nemotron 3 Super combines latent mixture of experts with hybrid Mamba architecture - 120B total parameters, 12B active per token, 1M context, and up to 4x more experts at the same cost.
NVIDIA released Nemotron 3 Super, and the architecture is worth paying attention to. It is a 120B parameter mixture-of-experts model, but only about 12B parameters are active per token. That ratio alone makes it interesting for inference costs. What makes it different from standard MoE is the "latent" approach - instead of routing raw tokens to experts, the model compresses tokens into a smaller representation before routing. Experts process these compressed inputs, which means you can run up to four times more experts at the same computational cost as a traditional MoE setup.
The other architectural piece is the hybrid Mamba integration. NVIDIA blends transformer attention layers with Mamba state-space layers, getting transformer-quality reasoning with Mamba's linear scaling on long sequences. The result is a model that handles its full 1M token context window efficiently, especially in multi-user serving scenarios where throughput matters more than single-request latency.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
One of the more notable aspects of Nemotron 3 Super is how NVIDIA handled the release. You can download the weights, self-host, fine-tune, and commercialize. The training documentation is published. This is the kind of openness that actually matters for developers - not just a model card and an API endpoint, but the full package that lets you build on top of it.
NVIDIA positions this as a balance between openness and capability. Many open models sacrifice intelligence for permissive licensing, or gate the best checkpoints behind restrictive terms. Nemotron 3 Super ships competitive benchmarks alongside genuinely permissive access. For teams evaluating sub-250B models for production use, that combination narrows the field significantly.
The model is available today through several channels. Perplexity has it integrated. Hugging Face hosts the weights for self-hosting. Major cloud providers offer managed inference. NVIDIA's own developer tools and build platform provide direct access for testing before you commit to infrastructure.
Benchmark results show improved throughput and coding performance versus prior Nemotron releases and other models in the sub-250B class. The latent MoE architecture pays off most visibly in multi-user scenarios - the compressed expert routing means you serve more concurrent requests before hitting memory or compute ceilings. For teams running inference at scale, the 12B active parameter footprint per token translates directly to lower cost per query while maintaining the quality of a much larger model.
Check out the full breakdown in the video above, or grab the weights from Hugging Face and try it yourself.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
What MCP servers are, how they work, and how to build your own in 5 minutes.
AI AgentsInstall Ollama, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting Started
New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Install Claude Code, configure your first project, and start shipping code with AI in under 5 minutes.

NVIDIA Nemotron 3 Super: Latent MoE + Hybrid Mamba, 1M Context, Faster Inference NVIDIA released Nemotron 3 Super, a new mixture-of-experts model with a new architecture combining latent mixture of e

NVIDIA just released Nemotron Nano 2 VL - an open-source vision language model that's 4x more efficient than previous models. In this video, I break down what makes this 12-billion parameter...

Check out NVIDIA's Llama Nemotron Nano 8B Vision Language Model here; https://nvda.ws/3HApYJ6 Exploring NVIDIA's Llama Nemotron Nano Vision Language Model: Benchmarks and Use Cases In this...

NVIDIA's Nemotron Nano 9B V2 delivers something rare: a small language model that doesn't trade capability for speed. Th...

DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the...

Meta's Llama 4 family brings mixture-of-experts to open source with Scout and Maverick. Here's how to run them locally,...