Apr 2026 · AI Strategy

Spud: What We Actually Know About OpenAI's Next Model, and What's Speculation

Built by an AI engineer. Not a journalist.

NotebookLM Podcast

Spud: What We Actually Know About OpenAI's Next Model

0:00 / 0:00

Pre-training finished March 24. Sam Altman says “a few weeks.” Greg Brockman says “big model feel.” Everything else you've read is somebody's guess dressed up as a spec sheet.

The sequence matters as much as the facts. Here's what's confirmed about Spud, how it lines up against Claude Mythos, Opus 4.7, and Gemini 3.1 Pro, and what actually matters before Google I/O on May 19.

The confirmed facts

There are five. Every other claim about parameter counts, context windows, pricing, or launch dates is inference.

1. The codename is Spud. Internal placeholder, first reported by The Information. Whether it ships as GPT-5.5 or GPT-6 hasn't been decided publicly - reportedly that depends on how significant the performance gap is relative to GPT-5.4.

2. Pre-training completed March 24, 2026. At OpenAI's Stargate data center in Abilene, Texas. Publicly confirmed by Sam Altman. OpenAI's typical pre-training-to-release window is 3-6 weeks, which puts the highest-probability launch window in May.

3. Sora was shut down the same day. Not coincidence. Sora was burning an estimated $1 million per day in compute against $2.1 million in lifetime in-app revenue. OpenAI walked away from the $1 billion Disney licensing deal built around it, reassigned the team to world models and robotics, and removed video generation from the product roadmap entirely.

4. Leadership language is unusually charged. Altman described Spud internally as “a very strong model” that could “accelerate the economy.” Brockman called it the product of “two years of research” with a “big model feel.” These aren't specs - they're tells. OpenAI doesn't use “big model feel” for incremental updates.

5. It's still in safety evaluation. No official date, no model card, no API announcement as of mid-April. Polymarket has trimmed “GPT-6 by April 30” from 78% to around 72%.

To put these facts in sequence: Google shipped Gemini 3.1 Pro on February 19. OpenAI released GPT-5.4 on March 5. Spud pre-training completed March 24, Sora shut down the same day, and OpenAI closed its $122B round on March 31. Anthropic announced Mythos and Project Glasswing on April 7. Opus 4.7 shipped April 16. Anthropic's $100B AWS commitment landed April 20. Three months of escalation, compressed into what I'd call the most consequential stretch in frontier AI since ChatGPT launched.

Mind map: Spud - branching structure tracing the confirmed facts, competitive context, and engineering implications of OpenAI's next model. — The full knowledge map - confirmed facts, speculation, and competitive positioning. Generated via NotebookLM.

Why OpenAI is under this much pressure

GPT-5.4, OpenAI's current flagship, launched March 5. Native computer use, a 1 million token context window, 33% fewer hallucinations than GPT-5.2. It scores 57.7% on SWE-bench Pro - credible, but not leading.

Then two things changed the shape of the field.

Google shipped Gemini 3.1 Pro on February 19. It scored 77.1% on ARC-AGI-2 - more than double Gemini 3 Pro. 80.6% on SWE-Bench Verified. 94.3% on GPQA Diamond. It leads 13 of 16 tracked benchmarks. At $2 per 1M input tokens and $12 per 1M output, it's 7.5× cheaper than Claude Opus 4.6 on input. For enterprises running high-volume workloads, the pricing is as important as the capability.

Then Anthropic revealed Claude Mythos - a model that doesn't sit in the Opus/Sonnet/Haiku hierarchy. It sits above it. The coding and reasoning numbers are in the benchmark grid below, but the one that matters most structurally is Anthropic's internal fuzzing result: on ~7,000 entry points where Opus 4.6 managed a single tier-3 crash, Mythos achieved 595 crashes at tiers 1 and 2, with full control-flow hijack on ten separate, fully patched targets. This is the capability that made Anthropic decide against a broad release.

On April 7, Anthropic announced Mythos would not be generally available - rolling it out instead through Project Glasswing to nine partners: AWS, Apple, Cisco, CrowdStrike, Google, JPMorganChase, Microsoft, Nvidia, and Broadcom, restricted to defensive cybersecurity use cases.

Last Thursday, Anthropic shipped Claude Opus 4.7. It beats Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across Anthropic's selected benchmarks, but still trails Mythos Preview. The release doubles as a live test of the safeguards before they apply them to Mythos-class models at broader scale.

So here's OpenAI's position heading into Q2: one competitor has a publicly benchmarked model that makes GPT-5.4 look like a previous generation on cyber and coding, and another has a cheaper, higher-context Pro-tier model leading the leaderboards on capability. Spud is the response.

What's speculated (but reasonable)

None of this is confirmed. All of it is inference from pattern, pressure, and process of elimination.

2M token context window. GPT-5.4 is already at 1M. Gemini 3.1 Pro ships at 1M input. Competitive parity alone makes 2M the floor for Spud.

Persistent memory as the headline feature. Altman has pointed at it publicly as the next major breakthrough - not the shallow “remember my name” layer that shipped last year, but cross-session state that survives between conversations. That kind of shift is what “two years of research” framing is written for.

Unified super-app architecture. OpenAI is building a combined platform: ChatGPT, Codex, Atlas, and other tools in one interface, with Spud powering the whole thing. The goal is users moving between conversation, code, web research, and agentic execution without switching contexts.

Stealth testing through GPT-5.4 in production. Multiple reports suggest OpenAI is running A/B tests of Spud outputs through existing GPT-5.4 Pro endpoints. Leaked outputs circulating include a one-shot playable VoxelCraft clone, a fully interactive Pokémon battle game, and SVG work accurate enough to look hand-traced. Treat these as signal, not proof.

The competitive matrix, as of April 20

Model	Available?	Context	Pricing (per 1M in/out)
Spud / GPT-6	No - safety eval	Speculated 2M	Unknown
Claude Mythos Preview	Glasswing partners only	Not disclosed	~5× Opus 4.7
Claude Opus 4.7	Yes, all channels	200K	Same as Opus 4.6
Gemini 3.1 Pro	Yes, preview	1M in / 64K out	$2 / $12
GPT-5.4	Yes	1M	-

The strategic read, in one line: Mythos is the defensive-cyber outlier you can't use. Opus 4.7 is the coding leader you can. Gemini 3.1 Pro wins on price-performance for high-volume loads. Spud is the empty column that reshapes all three positions the moment it ships.

The benchmark picture, apples-to-apples

Benchmark	Claude Mythos	Claude Opus 4.7	Gemini 3.1 Pro	GPT-5.4
SWE-bench Verified	93.9%	87.6%	80.6%	72.8%
SWE-bench Pro	-	64.3%	54.2%	57.7%
GPQA Diamond	94.6%	94.2%	94.3%	94.4%
Humanity's Last Exam	64.7%	54.7%	51.4%	58.7%
ARC-AGI-2	-	-	77.1%	-
OSWorld	79.6%	78.0%	-	75.0%
Cybench	100%	-	-	-

GPQA Diamond is saturating. All four models are within 0.4 points. This benchmark has stopped discriminating between frontier models. If you see a provider leading with GPQA Diamond in Q2, that's a tell.

HLE has a tool-use story worth pulling out. GPT-5.4 leads Opus 4.7 without tools, but Opus 4.7 takes it with tools. That gap - model capability versus model-plus-scaffolding capability - is where real-world performance gets decided.

Spud is the missing column. When it ships, those blanks fill in.

Google I/O on May 19 is the timing pressure point

Google I/O 2026 runs May 19-20. Gemini 4 is widely speculated for reveal - prediction markets put roughly a 15% probability on it shipping publicly before June 30. Google's pattern at I/O is to announce frontier models, not ship them. The most likely scenario: Gemini 4 announced May 19, staged rollout through Q2 and Q3.

This is the deadline OpenAI is racing. If Spud ships before May 19, OpenAI owns the narrative for at least a week before Google takes the stage. If Spud slips past I/O, Google defines what frontier capability looks like in Q2, and Spud launches into a conversation that's already moved on.

That's why Altman's “a few weeks” from March 24 matters. The window that preserves OpenAI's narrative advantage closes around May 18.

The other story that dropped today: Anthropic's $100B+ Amazon commitment

Anthropic announced a new compute agreement with Amazon on April 20 that changes how to read the whole landscape.

Anthropic committed more than $100 billion over ten years to AWS, securing up to 5GW of new capacity to train and run Claude. Significant Trainium2 capacity comes online in Q2; Trainium3 later this year; nearly 1GW in total before end of 2026. Amazon is investing $5 billion today, with up to $20 billion more to come, on top of the $8 billion previously committed.

The number that actually tells you why this happened: Anthropic's run-rate revenue has surpassed $30 billion, up from approximately $9 billion at the end of 2025.

The Claude regression complaints have a structural cause. Anthropic explicitly acknowledged that unprecedented consumer growth has hit reliability and performance for free, Pro, Max, and Team users during peak hours. Their answer: they're capacity-constrained, not quietly degrading the model. The $100B commitment is how they make that argument credible.

This is the AWS answer to Stargate. OpenAI's Stargate is a $500B infrastructure bet with Microsoft and Oracle, running on Nvidia silicon. Anthropic is betting on Amazon's custom chips - Trainium2, 3, and 4 - locked in through at least 2035. Two different hardware philosophies. If Trainium3 delivers on the price-performance Amazon claims, Anthropic's marginal cost per token stays structurally lower than OpenAI's.

Competitive landscape infographic: Spud vs. the field - benchmark grid, infrastructure bets, and the Google I/O timing window. — The full competitive picture - Spud as the missing column against Mythos, Opus 4.7, and Gemini 3.1 Pro. Generated via NotebookLM.

An engineer's read on what Spud actually changes

I've been building on these APIs long enough to know the difference between a model launch that changes what you write and a model launch that changes what you deploy. Spud is looking like the second kind.

Persistent memory is an architecture shift, not a feature

If Spud ships with genuine long-term memory - actual persistent state across conversations, not session-scoped layers - every RAG pipeline built in the last 18 months gets less valuable overnight. The whole reason we pay the complexity tax on vector stores, chunking strategies, and re-ranking is because the model forgets. If the model stops forgetting, a significant chunk of the LangChain/LlamaIndex ecosystem becomes legacy tooling for a problem that got solved upstream.

The catch: persistent memory introduces its own failure modes. Stale context, identity confusion across users, prompt-injection attacks that survive across sessions instead of dying at the turn boundary. Day-one persistent memory will probably have the same reliability profile as early function calling - impressive in demos, brittle in production.

The 2M context window matters less than people think

Everyone fixates on the number. What actually matters is what the model does at 80% fill. Gemini 3.1 Pro advertises 1M input tokens, but real-world retrieval quality degrades meaningfully past ~200K. If Spud ships 2M with flat retrieval quality across the whole window, that's a genuine capability jump. If it ships 2M with the usual mid-context rot, it's marketing.

The test I'll run on day one: needle-in-a-haystack at 500K, 1M, and 1.5M tokens, with the needle at the 40% mark - the hardest position. That number, not the max window, tells you whether you can actually use the context.

The stealth testing signal

If OpenAI is A/B-testing Spud outputs through GPT-5.4 Pro endpoints - and multiple reports suggest they are - the model is already running at production scale on real user traffic. That's further along than “safety evaluation” implies.

If your evals against GPT-5.4 have been weirdly inconsistent over the past two weeks, that's probably not your prompt.

What I'd bet on

Spud's coding capability beats Opus 4.7 on SWE-bench Verified by 2-4 points, not 10+. Mythos is the outlier. Persistent memory ships, and it's broken enough that most serious developers won't trust it until v2. The super-app rollout is staged - Codex + ChatGPT first, Atlas integration lagging. Launch pricing closes the gap with Gemini 3.1 Pro on output tokens, though matching Gemini outright requires TPU-level margins OpenAI doesn't have.

The bet underneath the bet: Spud vs. Mythos is a model comparison. Stargate vs. Trainium is a ten-year infrastructure race. The first determines who wins Q2. The second determines who's still here in 2030.

What to actually do with this information

Stop building against specific model IDs. Hardcoded gpt-5.4 or claude-opus-4-6 in config is a migration liability every 6-8 weeks at this cadence.

Five things that age well:

Config-flag your model selection. Swap at deploy time, not in code.
Build an eval set. A dozen representative tasks you can run against any model. When Spud ships, you'll know within an hour whether to switch.
Use the previous-response-ID pattern on the Responses API for better cache hits and lower latency on multi-turn flows.
Put human approval on any agentic tool use that touches external systems. This gets more important as models get better, not less.
Watch the 272K-token pricing cliff on GPT-5.4. Prompts over that threshold trigger a 2× input / 1.5× output multiplier.

The skill that transfers across model generations isn't knowing which one is best this week. It's knowing how to evaluate, route, and fail gracefully between them.

What to watch over the next 30 days

Spud launches before May 19. Most likely based on the confirmed pre-training date and Altman's stated timeline. The cleanest outcome for OpenAI's narrative.

Google I/O, May 19-20. Gemini 4 announced, likely with staged availability. Watch the live demos with unscripted inputs - those tell you whether benchmark numbers translate to real-world reliability.

Spud launches after I/O. Less likely, but possible. If it happens, the framing becomes “response to Google” rather than “lead the field.”

The model landscape you're building against on July 1 will not be the one you're building against today. Two decade-long infrastructure bets just got made in the same week. Plan for what's coming, not what's current.

Follow along for more AI research breakdowns.

← Back to Context Window