NVIDIA's Nemotron 3 Super in 6 Minutes

A New Take on Mixture of Experts

NVIDIA released Nemotron 3 Super, and the architecture is worth paying attention to. It is a 120B parameter mixture-of-experts model, but only about 12B parameters are active per token. That ratio alone makes it interesting for inference costs. What makes it different from standard MoE is the "latent" approach - instead of routing raw tokens to experts, the model compresses tokens into a smaller representation before routing. Experts process these compressed inputs, which means you can run up to four times more experts at the same computational cost as a traditional MoE setup.

For model-selection context, compare this with Claude vs GPT for Coding: Which Model Writes Better TypeScript? and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.

The other architectural piece is the hybrid Mamba integration. NVIDIA blends transformer attention layers with Mamba state-space layers, getting transformer-quality reasoning with Mamba's linear scaling on long sequences. The result is a model that handles its full 1M token context window efficiently, especially in multi-user serving scenarios where throughput matters more than single-request latency.

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

CLIs Over MCPs: Why the Best AI Agent Tools Already Exist

Mar 9, 2026 • 8 min read

Claude Code Loops: Recurring Prompts That Actually Run

Mar 7, 2026 • 6 min read

OpenAI's GPT 5.4 in 10 Minutes

Mar 6, 2026 • 8 min read

Claude Code: Remote Control, Auto Memory, Plugins & More

Feb 28, 2026 • 5 min read

Openness Done Right

One of the more notable aspects of Nemotron 3 Super is how NVIDIA handled the release. You can download the weights, self-host, fine-tune, and commercialize. The training documentation is published. This is the kind of openness that actually matters for developers - not just a model card and an API endpoint, but the full package that lets you build on top of it.

NVIDIA positions this as a balance between openness and capability. Many open models sacrifice intelligence for permissive licensing, or gate the best checkpoints behind restrictive terms. Nemotron 3 Super ships competitive benchmarks alongside genuinely permissive access. For teams evaluating sub-250B models for production use, that combination narrows the field significantly.

Where to Run It

The model is available today through several channels. Perplexity has it integrated. Hugging Face hosts the weights for self-hosting. Major cloud providers offer managed inference. NVIDIA's own developer tools and build platform provide direct access for testing before you commit to infrastructure.

Benchmark results show improved throughput and coding performance versus prior Nemotron releases and other models in the sub-250B class. The latent MoE architecture pays off most visibly in multi-user scenarios - the compressed expert routing means you serve more concurrent requests before hitting memory or compute ceilings. For teams running inference at scale, the 12B active parameter footprint per token translates directly to lower cost per query while maintaining the quality of a much larger model.

Check out the full breakdown in the video above, or grab the weights from Hugging Face and try it yourself.

NVIDIA Nemotron Nano 9B V2: Local AI That Punches Up

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

A New Take on Mixture of Experts

CLIs Over MCPs: Why the Best AI Agent Tools Already Exist

Claude Code Loops: Recurring Prompts That Actually Run

OpenAI's GPT 5.4 in 10 Minutes

Claude Code: Remote Control, Auto Memory, Plugins & More

Openness Done Right

Where to Run It

Comments

Related Tools

Qwen3-Coder

Related Guides

MCP Servers Explained

Run AI Models Locally with Ollama and LM Studio

Getting Started with Claude Code

Related Videos

NVIDIA's NEW Nemotron 3 Super in 6 Minutes

Related Posts

NVIDIA Nemotron Nano 9B V2: Local AI That Punches Up

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

NVIDIA Nemotron Nano 2 VL: Open Source Vision-Language Model

Qwen 3: Alibaba's Open-Source Model That Outclassed Llama 4

AI Coding Tools Pricing Comparison: What You Actually Pay in 2026

Get Smarter About AI Dev

NVIDIA Nemotron Nano 9B V2: Local AI That Punches Up

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

A New Take on Mixture of Experts

CLIs Over MCPs: Why the Best AI Agent Tools Already Exist

Claude Code Loops: Recurring Prompts That Actually Run

OpenAI's GPT 5.4 in 10 Minutes

Claude Code: Remote Control, Auto Memory, Plugins & More

Openness Done Right

Where to Run It

Comments

Related Tools

Qwen3-Coder

Related Guides

MCP Servers Explained

Run AI Models Locally with Ollama and LM Studio

Getting Started with Claude Code

Related Videos

NVIDIA's NEW Nemotron 3 Super in 6 Minutes

Related Posts

NVIDIA Nemotron Nano 9B V2: Local AI That Punches Up

Llama 4: The Complete Developer's Guide to Meta's Open Source Models

DeepSeek R1 and V3: The Developer's Guide to Open-Source AI

NVIDIA Nemotron Nano 2 VL: Open Source Vision-Language Model

Qwen 3: Alibaba's Open-Source Model That Outclassed Llama 4

AI Coding Tools Pricing Comparison: What You Actually Pay in 2026

Get Smarter About AI Dev