TL;DR
Inception Labs shipped the first reasoning model built on diffusion instead of autoregressive generation. Over 1,000 tokens per second, competitive benchmarks, and a fundamentally different approach to how AI generates text.
Every LLM you use today is a typewriter. One token at a time, left to right, each keystroke permanent. If the reasoning drifts early, tough luck. It can only move forward.
Mercury 2 is an editor. It starts with a rough draft and sharpens the whole thing with each pass. And it does this at over 1,000 tokens per second.
Inception Labs just shipped the first reasoning model built on diffusion instead of autoregressive generation. The same fundamental approach that already won in image and video generation, now applied to language. And the results are real.
Remember when Groq hit the scene? Raw inference speed got everyone excited. But the models that could run that fast were limited. They couldn't do tool calling well. They struggled with complex reasoning. Lower benchmark scores across the board. Speed at a real cost.
The entire industry has been racing to solve this since. OpenAI, NVIDIA, Fireworks, Baseten. Billions spent on better hardware, better kernels, quantization, distillation. Real gains, but all incremental. Everyone squeezing more out of the same autoregressive paradigm.
Mercury 2 took a different path. The speed comes from the model itself, not infrastructure optimization.

Autoregressive generation: token one locks before token two begins. Sequential. Permanent. If you make a mistake early, it cascades through everything that follows.
Diffusion generation: start with noise, iteratively refine the entire output in parallel. Multiple tokens per forward pass. Built-in error correction because the model revisits and refines as it goes.
This is actually closer to how humans think. You don't reason word by word. You hold the whole idea, draft, revise, reconsider, then commit. CMU researchers found in September 2025 that diffusion models are "significantly more robust to data repetition" than autoregressive models, especially in data-constrained settings. The academic community is taking this architecture seriously: the LLaDA paper introduced diffusion as a viable alternative to autoregressive text generation and has been gaining traction.
The throughput numbers tell the story:
| Model | Output Throughput |
|---|---|
| Mercury 2 | 1,008 tok/s |
| Claude Haiku 4.5 | ~89 tok/s |
| GPT-5 mini | ~71 tok/s |
That's over 10x throughput. On reasoning tasks specifically, 5x faster than speed-optimized autoregressive models.
Get the weekly deep dive
Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.
Speed without quality is just fast garbage. Mercury 2 holds up:
| Benchmark | Mercury 2 | GPT-5 mini |
|---|---|---|
| AIME 2025 | 91.1 | 91.1 |
| GPQA | 73.6 | Competitive |
| LiveCodeBench | 67.3 | Competitive |
| IFBench | 71.3 | -- |
| SciCode | 38.4 | -- |
Important context: these comparisons are against speed-optimized models, not frontier models. Mercury 2 plays in the speed + reasoning lane. It's not trying to beat Opus on raw intelligence. It's trying to give you reasoning-grade quality at speeds that unlock entirely new application patterns.
Worth noting: Mercury v1 (early 2025) had real limitations. ACI.dev's beta review flagged hallucination issues and a 16K context ceiling. Mercury 2 is a significant leap: 128K context, native tool use, and tunable reasoning. The gap between v1 and v2 is large enough that early criticism doesn't map cleanly to the current model.

Three use cases where this speed changes what you can build:
Latency compounds across multi-step workflows. Every tool call, every reasoning step adds wait time. In a demo app built for the video, Mercury 2 ran search, scrape, and summarize before most models would finish their first response. Code agents, browser automation, IT triage: more steps, tighter feedback cycles. Skyvern is already using it in production and reports Mercury 2 is "at least twice as fast as GPT-5.2."
p95 latency determines if a voice interface feels natural or robotic. Support agents, voice bots, real-time translation. When you need reasoning inside tight SLAs, speed isn't a nice-to-have. Companies like Wispr Flow (real-time transcript cleanup), OpenCall (voice agents), and Happyverse AI (real-time voice/video avatars) are already shipping with Mercury under the hood.
The prompt-review-tweak loop. Rapid succession iteration. The faster the model responds, the more you stay in flow. Zed, the code editor, integrated Mercury and described it as "suggestions land fast enough to feel like part of your own thinking." JetBrains published research arguing diffusion models "better reflect how developers think" because they edit and refine rather than writing left-to-right.
Mercury 2 is OpenAI API compatible. Swap the base URL, model string, and API key. Works with any framework that supports OpenAI's format.
That pricing makes it one of the most cost-competitive reasoning models available. For high-volume agent workloads where you're making hundreds of calls per session, the economics are compelling.

Inception Labs isn't a random startup. CEO Stefano Ermon is a Stanford CS associate professor who co-authored DDIM (the denoising method powering Stable Diffusion and Midjourney). His co-founders Aditya Grover (UCLA) and Volodymyr Kuleshov (Cornell) are both former students. The team includes veterans from DeepMind, Meta, OpenAI, Microsoft, and HashiCorp.
Backed by $50M from Menlo Ventures, M12 (Microsoft), NVentures (NVIDIA), Snowflake Ventures, and Databricks. Individual investors include Andrew Ng and Andrej Karpathy. Fortune 100 companies (unnamed) are already running Mercury in production. Available on Azure AI Foundry.
The people who proved diffusion works for pixels are now proving it works for tokens.
Whether diffusion becomes the future of how all LLMs work is an open question. But the trajectory is clear. Autoregressive generation has a fundamental speed ceiling that no amount of hardware can fully overcome. Diffusion solves that at the model level.
Mercury 2 is the proof point. Fast enough to change what you can build. Cheap enough to actually use at scale. And backed by the people who literally wrote the math.

Try it yourself:
This article is based on a Developers Digest video sponsored by Inception Labs. All technical claims are sourced from third-party benchmarks and direct testing.
Further Reading:
Technical content at the intersection of AI and development. Building with AI agents, Claude Code, and modern dev tools - then showing you exactly how it works.
The original AI coding assistant. 77M+ developers. Inline completions in VS Code and JetBrains. Copilot Workspace genera...
View ToolOpen-source AI pair programming in your terminal. Works with any LLM - Claude, GPT, Gemini, local models. Git-aware ed...
View Tool
New tutorials, open-source projects, and deep dives on coding agents - delivered weekly.
Most popular LLM framework. 100K+ GitHub stars. Chains, RAG, vector stores, tool use. LangGraph adds stateful multi-agen...
Install the dd CLI and scaffold your first AI-powered app in under a minute.
Getting StartedConfigure Claude Code for maximum productivity -- CLAUDE.md, sub-agents, MCP servers, and autonomous workflows.
AI AgentsWhat MCP servers are, how they work, and how to build your own in 5 minutes.
AI Agents
Mercury Two: The First Reasoning Diffusion LLM (1,000+ tokens/sec) - Speed Without Sacrificing Quality Inception Labs releases Mercury Two, a reasoning diffusion-based LLM that exceeds 1,000 tokens p

Check out Trae here! https://tinyurl.com/2f8rw4vm In this video, we dive into @Trae_ai a newly launched AI IDE packed with innovative features. I provide a comprehensive demonstration...

Meet ChatLLM Operator 🌐✈️📊 In this video, I'll show you the capabilities of ChatLLM Operator. Discover how this affordable tool, at just $10 a month, can autonomously handle tasks...

A comprehensive look at Claude Skills-modular, persistent task modules that shatter AI's memory constraints and enable p...

GPT-5 introduces a fundamentally different approach to inference. Instead of forcing developers to manually configure re...

Two platforms, two philosophies. Here is how Anthropic and OpenAI compare on APIs, SDKs, documentation, pricing, and the...