TL;DR
NVIDIA's Nemotron Nano 9B V2 delivers something rare: a small language model that doesn't trade capability for speed. This 9B parameter model outperforms Qwen 3B across instruction following, math,...
NVIDIA's Nemotron Nano 9B V2 delivers something rare: a small language model that doesn't trade capability for speed. This 9B parameter model outperforms Qwen 3B across instruction following, math, science, coding, and tool use - while delivering up to 6.3x faster throughput.
The secret is a hybrid architecture combining Mamba 2 with transformer layers. Four attention layers handle the heavy reasoning lifting, while MLP layers and the Mamba state space model handle everything else. You get transformer accuracy with Mamba speed.

At 9B parameters, this model lands in a sweet spot. It runs on consumer hardware - your gaming GPU can handle it. The edge deployment story actually works here.
NVIDIA released more than just model weights. The NeMo pre-training dataset V1 is available on HuggingFace, giving you the foundation data if you want to build derivatives. The model itself is on HuggingFace with a permissive license, or you can test it immediately on build.nvidia.com.
Training leveraged Megatron LM and NeMo for reinforcement learning. The model supports six languages: English, German, Spanish, French, Italian, and Japanese - improved through cross-pollination with the Qwen ecosystem.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Most reasoning models force you into their pace. Nemotron Nano gives you control through system prompts. Tag hard questions with /think to engage full reasoning, or use /no_think for instant responses on simple queries.

The reasoning budget goes deeper. During inference, you can set minimum thinking tokens. Dial it up for AIME 2025 problems - where the model shows dramatic gains - or down for straightforward tasks. The correlation is clear: more thinking tokens yield better results, particularly on MATH-500 where accuracy reaches the mid-90s with sufficient budget.
The technical report reveals how NVIDIA evolved their data mixture across three training phases. Phase one was code-heavy with crawled content and academic material. By phase three, the composition shifted dramatically toward STEM, with code and crawled content reduced significantly. This deliberate progression from broad to specialized data likely contributes to the model's strong reasoning performance.

Testing on build.nvidia.com demonstrates both speed and capability. The classic "how many Rs in strawberry" problem - one that tripped up many larger models - gets solved in under a second with full reasoning shown: the model breaks down letter positions, counts occurrences, and returns the correct answer of three.
Tool use works seamlessly. Ask for Harry Potter facts, and the model identifies the need for the character description tool, invokes it with correct arguments, processes the response, and formats five coherent facts. The reasoning trace shows active reflection: "this is actually six points... let me check them more carefully."
With reasoning disabled, ten paragraphs on Mamba architecture generate almost instantly. The model adapts to the constraint rather than forcing unnecessary computation.
Nemotron Nano 9B V2 combines:
NVIDIA continues to strengthen both sides of the AI equation - hardware dominance plus increasingly capable open-source models. The Nemotron Nano 9B V2 proves you don't need massive parameter counts for serious performance. You need the right architecture and training approach.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.

NVIDIA's Nemotron 3 Super combines latent mixture of experts with hybrid Mamba architecture - 120B total parameters, 12B...

DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the...

Meta's Llama 4 family brings mixture-of-experts to open source with Scout and Maverick. Here's how to run them locally,...