Assistants to Responses API: A Migration Field Guide

The Deprecation Timeline

OpenAI confirmed the Assistants API sunset in the developer changelog: new endpoints frozen now, full shutdown in 2026. Threads, runs, run-steps, and the assistant resource itself all go away. Files and vector stores survive (they moved into the Responses API surface). Function calling survives but the schema is slightly different. The Code Interpreter and File Search tools survive as built-in tools on Responses.

For the design side of the same problem, read OpenAI Codex: Cloud AI Coding With GPT-5.3 with OpenAI vs Anthropic in 2026 - Models, Tools, and Developer Experience; they show how agent-generated interfaces fail and how to give coding agents better visual constraints.

If you are running production code against client.beta.threads.* today, you have homework. I had a 14-month-old Assistants codebase running newsletter automation, customer support triage, and a chunk of internal ops. Last weekend I migrated all of it. This is the field guide - every cliff I hit, in order, with the code diffs that worked.

For the visual walkthrough including the eval harness I used to gate the cutover, see the DevDigest YouTube channel.

Conceptual Diff: Threads vs. State

The Assistants API was server-stateful. You created a thread, posted messages, kicked off runs, polled for completion, and OpenAI held the conversation history. Your code did not own the state.

The Responses API is client-stateful by default, server-stateful by opt-in. Each call returns a response.id. You pass previous_response_id on the next call to get continuity. The server stores the chain for 30 days. After that, you reconstruct from your own DB or pass the message array explicitly.

This is the right design - server-only state was a footgun for compliance, debugging, and multi-region - but it changes how you think about every conversation:

Assistants	Responses
`threads.create()`	nothing - just call `responses.create`
`threads.messages.create()`	include in `input` array
`runs.create()` + poll	`responses.create()` returns synchronously or streams
`run.required_action`	`response.required_action` (similar but flatter)
`assistants.create()`	`prompts` + system messages + tools per call

The big mental shift: there is no assistant object anymore. The "assistant" is your prompt template + tool list + model config, which you supply per call. This is why I version mine in Promptlock - the prompt is now a first-class artifact in your repo, not a row in OpenAI's database.

Code-Level Migration

Here is the minimal-diff before/after for a single conversation turn. The "before" is the standard Assistants pattern most of us wrote in 2024:

// BEFORE  -  Assistants API
const thread = await client.beta.threads.create();
await client.beta.threads.messages.create(thread.id, {
  role: "user",
  content: userMessage,
});
const run = await client.beta.threads.runs.createAndPoll(thread.id, {
  assistant_id: ASSISTANT_ID,
});
const messages = await client.beta.threads.messages.list(thread.id);
const reply = messages.data[0].content[0].text.value;

// AFTER  -  Responses API
const response = await client.responses.create({
  model: "gpt-5.5",
  instructions: SYSTEM_PROMPT,
  input: userMessage,
  tools: TOOLS,
  previous_response_id: priorResponseId, // null on first turn
  store: true, // 30-day server retention
});
const reply = response.output_text;
const newResponseId = response.id; // persist for next turn

The "after" version is shorter, synchronous on the happy path, and the conversation chain lives in two places you control: your DB row (the response.id) and your prompt repo (SYSTEM_PROMPT).

Get the weekly deep dive

Tutorials on Claude Code, AI agents, and dev tools - delivered free every week.

From the archive

Prompt Caching in the Claude API: A Production Guide

Apr 29, 2026 • 11 min read

RAG with Claude: Add Context Without Retraining

Apr 29, 2026 • 10 min read

SAM 3.1: Realtime Video Segmentation in Apps

Apr 29, 2026 • 10 min read

Self-Hosting AI Agents: 5 Ways to Run Claude Code on Your Own Infra

Apr 29, 2026 • 13 min read

State and History Handling

This is where I lost the most time. Three patterns I now use:

Pattern 1: Short-lived chains (default). Persist previous_response_id against your conversation row. On each turn, pass it. Trust OpenAI's 30-day retention. This is what most apps want.

await db.conversation.update({
  where: { id: convId },
  data: { lastResponseId: response.id },
});

Pattern 2: Long-lived or compliance-bound chains. Do not rely on server retention. Store every message in your DB and pass them explicitly:

const response = await client.responses.create({
  model: "gpt-5.5",
  instructions: SYSTEM_PROMPT,
  input: messages.map((m) => ({ role: m.role, content: m.content })),
  store: false, // do not retain server-side
});

Pattern 3: Hybrid. Short-lived state via previous_response_id, but you also write every input/output to your DB for replay and eval purposes. This is what I run in production. It is the only pattern that gives you both ergonomic continuity and full-control debugging.

The cliff I hit: I assumed previous_response_id would still work after 31 days. It does not - the server returns a 404. Wrap every call in a fallback that reconstructs from your DB if the chain is missing.

Tool-Use Parity

Function calling works, with a flatter schema. The tools array is the same shape. The big differences:

Built-in tools. code_interpreter and file_search are now first-class tools you enable per call. No more attaching them to an assistant.
Parallel tool calls. Default-on in Responses. If your old code assumed serial tool execution, audit your handlers - they will now fire in parallel.
Streaming tool calls. You can stream tool-call deltas, which means you can render "agent is calling tool X..." in real time. Assistants forced you to wait for requires_action.

Here is the parallel-tool gotcha. In Assistants, this code was safe:

// Assistants  -  implicit serial
for (const call of run.required_action.submit_tool_outputs.tool_calls) {
  const output = await runTool(call); // safe, one at a time
}

In Responses, the model now expects you to handle multiple tool calls concurrently. If runTool is not idempotent or hits a rate-limited downstream, batch your calls or Promise.all them with a concurrency cap:

import pLimit from "p-limit";
const limit = pLimit(3);
const outputs = await Promise.all(
  response.required_action.submit_tool_outputs.tool_calls.map((call) =>
    limit(() => runTool(call))
  )
);

I missed this on my first migration. The customer-support agent fired four parallel ticket-update calls to a legacy CRM and got rate-limited into oblivion within an hour.

Eval-Driven Cutover

The migration is mechanical but the behavior is not always identical. Different default temperatures, different tool-call patterns, different message-formatting quirks. I would not cut over without a regression eval.

My harness: a flag-gated rollout where 10% of traffic goes to Responses, 90% to Assistants, both runs are logged with the same input, and a nightly job scores the diffs. I open-sourced the bones of this as Agent Eval Bench - input replay, output diff, automated grading via a stronger model.

The cutover schedule that worked for me:

Week 1 - Build the Responses path behind a feature flag. 0% traffic. Run shadow evals on logged inputs.
Week 2 - 10% live traffic. Watch error rates, latency, customer-reported issues.
Week 3 - 50% if metrics hold. Bug-fix anything weird.
Week 4 - 100%. Keep the Assistants code path in the repo with @deprecated comments for one more month, then delete.

Burn-down looked roughly like this in my logs:

Day 1: 47 endpoints calling Assistants
Day 7: 47 (built path, no traffic yet)
Day 9: 47 → 47 (10% rollout, both alive)
Day 14: 47 → 12 (cut the safe ones, kept stateful chains on assistants)
Day 21: 12 → 3 (long-lived chain edge cases)
Day 28: 0

The last three were the long-lived stateful chains where I needed pattern 2 above (explicit history). They took longer because I had to backfill DB writes for conversations that had been server-stateful for months.

What I Would Do Differently

Three things in priority order:

Start with eval, not code. Get the harness running before you write a line of migration code. Without a regression signal you are migrating blind.
Migrate stateless flows first. RAG queries, one-shot tool calls, summarization. These are mechanical search-and-replace. Build confidence before tackling stateful chains.
Audit parallel tool calls explicitly. Do not assume. Grep your runTool implementations for shared mutable state. The parallel-by-default behavior will find every race condition you have.

OpenAI gave us through 2026, which sounds generous until you remember every other library you depend on is also moving. Do not be the one team migrating in October.

The Responses API is the better primitive. It is simpler, more honest about state, and the streaming model finally feels native. The migration is a weekend of work for a small codebase and two weeks for a complex one. Worth it.

GPT-5.4 for Developers: The Production Guide

GPT-5.5 for Developers: A Production Field Guide

DeepSeek V4: The Developer's Guide to Flash and Pro

The Deprecation Timeline

Conceptual Diff: Threads vs. State

Code-Level Migration

Prompt Caching in the Claude API: A Production Guide

RAG with Claude: Add Context Without Retraining

SAM 3.1: Realtime Video Segmentation in Apps

Self-Hosting AI Agents: 5 Ways to Run Claude Code on Your Own Infra

State and History Handling

Tool-Use Parity

Eval-Driven Cutover

What I Would Do Differently

Comments

Related Tools

Droid

Vercel AI SDK

OpenRouter

GPT-5

Apps from Developers Digest

Migrate

Related Guides

Claude Code Setup Guide

Building Your First MCP Server

Chronicle Research Preview Setup Guide

Related Posts

Codex Automations: Where Scheduled AI Agents Actually Help

Codex Is Becoming a General-Purpose AI Agent, Not Just a Coding Tool

Codex SDK vs CLI vs GitHub Action: Which Surface Should You Build On?

GPT Image 2 Prompt Libraries Are Becoming Production Infrastructure

Karpathy's Loopy Era Is the Best Way to Understand Codex

OpenAI's Codex Mac Certificate Deadline Is a Runbook Test

Get Smarter About AI Dev

GPT-5.4 for Developers: The Production Guide

GPT-5.5 for Developers: A Production Field Guide

DeepSeek V4: The Developer's Guide to Flash and Pro

The Deprecation Timeline

Conceptual Diff: Threads vs. State

Code-Level Migration

Prompt Caching in the Claude API: A Production Guide

RAG with Claude: Add Context Without Retraining

SAM 3.1: Realtime Video Segmentation in Apps

Self-Hosting AI Agents: 5 Ways to Run Claude Code on Your Own Infra

State and History Handling

Tool-Use Parity

Eval-Driven Cutover

What I Would Do Differently

Comments

Related Tools

Droid

Vercel AI SDK

OpenRouter

GPT-5

Apps from Developers Digest

Migrate

Related Guides

Claude Code Setup Guide

Building Your First MCP Server

Chronicle Research Preview Setup Guide

Related Posts

Codex Automations: Where Scheduled AI Agents Actually Help

Codex Is Becoming a General-Purpose AI Agent, Not Just a Coding Tool

Codex SDK vs CLI vs GitHub Action: Which Surface Should You Build On?

GPT Image 2 Prompt Libraries Are Becoming Production Infrastructure

Karpathy's Loopy Era Is the Best Way to Understand Codex

OpenAI's Codex Mac Certificate Deadline Is a Runbook Test

Get Smarter About AI Dev