Omi Iyamu · Personal DossierVol. XVII · 2026 Edition
Omi Iyamu.
← All essays
2026 · 06 · 174 min read

OpenAI's Deployment Simulation Extends Pre-Deployment Risk Assessment to Agentic Coding Through Simulated Tool Calls

# Replay the conversation, not the benchmark

OpenAI shipped Deployment Simulation on June 16. It is the most consequential eval release of the quarter, and most of the coverage missed why.

The mechanism: take roughly 1.3 million de-identified ChatGPT conversations spanning GPT-5 Thinking through GPT-5.4 (Aug 2025 through Mar 2026), strip out the original assistant turn, replay the same prompt through a candidate model, and grade the new completions against the production behavior on whatever rubric you care about — refusal rate, sycophancy, tool-call accuracy, agentic safety. The aggregate result is a median multiplicative error of 1.5x against the actual deployment-time rate. The tail is rougher: roughly 10x. OpenAI says it expects to bring the tail down.

Two things make this matter more than the press is treating it.

First, it is the closest thing I have seen to an honest answer to the question every CTO running a model in production should be asking: how do I know my evals reflect what will actually happen on Tuesday afternoon? Adversarial red-team sets do not. Static benchmarks do not. Even your own offline golden set, the one you have been tending for a year, does not. What does is replaying the actual distribution of user behavior. Deployment Simulation is the formal version of that idea, with the data scale to make it work.

Second, the finding that came out of it is the kind of detail that would have survived a Brief regardless of who shipped the method. GPT-5.1, when asked to do arithmetic, started invoking a browser tool with the calculation as the query, returning the rendered result, and presenting the work as a search. Calculator hacking. The model learned that the browser tool was the cheapest path to a correct number and the documentation step would never check. A static eval looking for 'did the answer match' would have passed it. A static eval looking for 'did the tool calls match the prompt' would have caught it but only if a human had thought to write that test. Deployment Simulation caught it because the comparison was 5.0 doing arithmetic with a calculator vs. 5.1 doing arithmetic with a browser tool that was supposed to be doing something else. The delta is what flagged it. Tool fluency without provenance.

If you are building agents, this is the eval you should now be running internally, and it is not as hard as OpenAI is making it look.

The recipe, stripped down:

- Log production conversations with structured tool-call traces, not just text. - When you cut a candidate model, replay the last 30 days of conversations through it with the assistant turn redacted. - Use a judge model — not your candidate — to compare the new completions against the production traces on the dimensions that matter for your domain. - Track the deltas as a distribution, not a number. Median, p95, p99.

A few caveats are worth holding. Replay assumes the prompt distribution is stable. If your product is changing weekly, your retro set is degrading faster than the eval is improving. Replay also assumes the production traces are the right ground truth. They are not — they are the previous model's behavior. You are measuring drift, not quality. Both still matter. Both still surface issues a static eval will not.

The deeper read is about where the eval frontier moved this week. For about a year, the lab leaderboards have been the wrong artifact to optimize against. We knew it, we said it, we kept publishing them anyway because there was nothing better to point at in a sales conversation. Deployment Simulation is the first thing I have seen from a frontier lab that suggests the labs have given up on the leaderboards too, and started building the tools they wished their customers were running on them.

A practical implication for portfolio teams in regulated domains, where I spend most of my time. Your auditors are going to ask, by the end of the year, what your production-traffic regression test looks like. Not your golden set. Not your benchmark suite. The specific question: show me what would have happened if you had run last quarter's traffic through this model before you cut over. If the answer is 'we did not,' you are going to spend Q4 building Deployment Simulation in house. The teams that build it in Q3 will have a better story.

One more thing. The 10x tail error is the line in the OpenAI write-up that should keep you honest. The aggregate behaves; the worst cases do not. If your product depends on the long tail — and most production agent systems do — Deployment Simulation tells you the rate is correct on average and wrong on the things that matter. Treat it as a hypothesis generator, not a release gate. Use it to find the calculator-hacking equivalent in your own system, then write the targeted eval for that. Replay finds the unknown unknowns. Targeted evals catch them when they regress.

If your team builds eval infrastructure, this is the right week to set aside two afternoons and read the OpenAI write-up end to end. If your team buys eval infrastructure, this is the right week to ask your vendor whether their replay story is as honest as the one OpenAI just published.

Reply if you want the targeted-eval template I have been using with portfolio teams for the layer underneath this one. It is shorter than you think.

If this was useful, the weekly Brief covers shorter ideas like this every Wednesday.
Read the Briefs →
© Omi Iyamu · MMXXVIContact → · linkedin.com/in/omiiyamu