TL;DR
NVIDIA's Nemotron Nano 9B V2 delivers something rare: a small language model that doesn't trade capability for speed. This 9B parameter model outperforms Qwen 3B across instruction following, math,...
Read next
NVIDIA's Nemotron 3 Super combines latent mixture of experts with hybrid Mamba architecture - 120B total parameters, 12B active per token, 1M context, and up to 4x more experts at the same cost.
5 min readNVIDIA's Nemotron Nano 2 VL delivers vision-language capabilities at a fraction of the computational cost. This 12-billion-parameter open-source model processes videos, analyzes documents, and reas...
5 min readDeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the API, run them locally with Ollama, and decide when they beat closed-source alternatives.
9 min readNVIDIA's Nemotron Nano 9B V2 delivers something rare: a small language model that doesn't trade capability for speed. This 9B parameter model outperforms Qwen 3B across instruction following, math, science, coding, and tool use - while delivering up to 6.3x faster throughput.
For model-selection context, compare this with Claude vs GPT for Coding: Which Model Writes Better TypeScript? and OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; the useful question is not only benchmark quality, but where the model fits in a real developer workflow.
The secret is a hybrid architecture combining Mamba 2 with transformer layers. Four attention layers handle the heavy reasoning lifting, while MLP layers and the Mamba state space model handle everything else. You get transformer accuracy with Mamba speed.

At 9B parameters, this model lands in a sweet spot. It runs on consumer hardware - your gaming GPU can handle it. The edge deployment story actually works here.
NVIDIA released more than just model weights. The NeMo pre-training dataset V1 is available on HuggingFace, giving you the foundation data if you want to build derivatives. The model itself is on HuggingFace with a permissive license, or you can test it immediately on build.nvidia.com.
Training leveraged Megatron LM and NeMo for reinforcement learning. The model supports six languages: English, German, Spanish, French, Italian, and Japanese - improved through cross-pollination with the Qwen ecosystem.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
From the archive
Most reasoning models force you into their pace. Nemotron Nano gives you control through system prompts. Tag hard questions with /think to engage full reasoning, or use /no_think for instant responses on simple queries.

The reasoning budget goes deeper. During inference, you can set minimum thinking tokens. Dial it up for AIME 2025 problems - where the model shows dramatic gains - or down for straightforward tasks. The correlation is clear: more thinking tokens yield better results, particularly on MATH-500 where accuracy reaches the mid-90s with sufficient budget.
The technical report reveals how NVIDIA evolved their data mixture across three training phases. Phase one was code-heavy with crawled content and academic material. By phase three, the composition shifted dramatically toward STEM, with code and crawled content reduced significantly. This deliberate progression from broad to specialized data likely contributes to the model's strong reasoning performance.

Testing on build.nvidia.com demonstrates both speed and capability. The classic "how many Rs in strawberry" problem - one that tripped up many larger models - gets solved in under a second with full reasoning shown: the model breaks down letter positions, counts occurrences, and returns the correct answer of three.
Tool use works seamlessly. Ask for Harry Potter facts, and the model identifies the need for the character description tool, invokes it with correct arguments, processes the response, and formats five coherent facts. The reasoning trace shows active reflection: "this is actually six points... let me check them more carefully."
With reasoning disabled, ten paragraphs on Mamba architecture generate almost instantly. The model adapts to the constraint rather than forcing unnecessary computation.
Nemotron Nano 9B V2 combines:
NVIDIA continues to strengthen both sides of the AI equation - hardware dominance plus increasingly capable open-source models. The Nemotron Nano 9B V2 proves you don't need massive parameter counts for serious performance. You need the right architecture and training approach.
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
Open-source ChatGPT alternative that runs 100% offline. Desktop app with local models, cloud API connections, custom ass...
View ToolThe easiest way to run LLMs locally. One command to pull and run any model. OpenAI-compatible API. 52M+ monthly download...
View ToolDesktop app for discovering, downloading, and running local LLMs. Clean chat UI, OpenAI-compatible API server, and autom...
View ToolPrivate local AI chatbot by Nomic. 250K+ monthly users, 65K GitHub stars. LocalDocs feature lets you chat with your own...
View ToolInstall Ollama and LM Studio, pull your first model, and run AI locally for coding, chat, and automation - with zero cloud dependency.
Getting StartedA concrete step-by-step guide to moving your development workflow from Cursor to Claude Code - settings, rules, keybindings, and the habits that transfer.
Getting StartedPath-specific rules that only load for matching files.
Claude Code
NVIDIA's Nemotron 3 Super combines latent mixture of experts with hybrid Mamba architecture - 120B total parameters, 12B...

NVIDIA's Nemotron Nano 2 VL delivers vision-language capabilities at a fraction of the computational cost. This 12-billi...

DeepSeek's R1 and V3 models deliver frontier-level performance under an MIT license. Here's how to use them through the...

Meta's Llama 4 family brings mixture-of-experts to open source with Scout and Maverick. Here's how to run them locally,...

Microsoft's PHI-4 is an MIT-licensed 14 billion parameter model that matches Llama 3.3 70B and Qwen 2.5 72B on key bench...

Alibaba's newest Qwen release claims flagship-level coding in a 27B dense model. Here is why dense matters, where it fit...

New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.