May 2026 · AI Infrastructure

The TPU Moment: A Supply Chain Story as a Silicon Story

NotebookLM Podcast

0:00 / 0:00

On November 24, 2025, The Information published a report that Meta was in talks to deploy Google's TPUs at scale. Nvidia's stock dropped roughly 3% that day. Not because Google's chip is faster than an H100, not because CUDA is suddenly irrelevant, but because for the first time, a company with Nvidia's largest single customer on the hook was looking – seriously, not hypothetically – at running a billion-dollar training workload somewhere else.

That's the signal. Not the benchmark. The signal.

My thesis

I'm going to say something that sounds counterintuitive: TPUs are not winning because they beat Nvidia at the chip level. They're winning because the AI industry is no longer willing to bet a generation of model development on a single-vendor supply chain, and Google is the only company with the silicon, the cloud capacity, and a co-design partner capable of delivering an alternative at hyperscale.

The TPU story is a supply chain story as a silicon story.

Everything else – the Ironwood specs, the JAX compiler, the systolic array architecture – flows from that. Understanding why Anthropic reserved a million TPUs, and why Meta is quietly testing them, requires understanding procurement logic before it requires understanding chip architecture.

Let's start with the deals.

Mind map of The TPU Moment: major deals (Anthropic, Meta, Broadcom partnership), technical architecture (chip design, pod-level performance, hardware generations), software ecosystem (frameworks, engineering workflow), and strategic advantages (supply chain optionality, lower TCO, scaling reliability, inference cost-optimization). — The full landscape: deals, architecture, software, strategic advantages. Generated via NotebookLM.

The deal landscape

What's confirmed

The first major anchor is the Anthropic–Google compute agreement, signed October 23, 2025. The headline number is up to one million TPUs. Per SemiAnalysis's breakdown, that splits into two tranches: approximately 400,000 Ironwood units purchased directly via Broadcom – roughly $10 billion – deployed in Anthropic's own facilities; plus 600,000 units accessed through Google Cloud at an estimated $42 billion. Total capacity committed for 2026: greater than one gigawatt.

On April 24, 2026, Google announced a $40 billion investment in Anthropic at a $350 billion post-money valuation – one of the largest venture investments on record. Embedded in that deal is an additional five gigawatts of TPU compute over five years, layered on top of the October agreement. Add in Anthropic's separate AWS Trainium reservation, also five gigawatts, and you get to a number that stops sounding like a chip procurement and starts sounding like utility infrastructure planning: ten gigawatts of reserved frontier compute, split across two hyperscalers.

Meta's deal is structurally different but strategically just as important. On February 26, 2026, Meta signed a multi-billion-dollar TPU rental agreement with Google for 2026 workloads, with intent to purchase directly in 2027. The size of that eventual direct purchase isn't confirmed – industry consensus, based on an AceCamp expert interview, places it at 500,000 to 800,000 chips. I'll flag that as speculative: it's directionally credible but not a committed number.

In April 2026, at Google Cloud Next, Ironwood (TPU v7) reached general availability.

The silicon supply chain underneath

The manufacturer behind the Ironwood chip – and the upcoming 8th-generation TPU – is Broadcom. Not as a contract chipmaker in the traditional sense: Broadcom co-designed the silicon, runs the high-speed networking layer, and is under a contract worth approximately $46 billion running through 2031. That's not a supplier relationship. That's a strategic partnership with a six-year horizon baked in.

The 8th-generation roadmap, previewed but not committed, splits into two chips: TPU 8t Sunfish, a training-optimized part co-designed by Broadcom; and TPU 8i Zebrafish, an inference-optimized part designed by MediaTek. Both are targeting TSMC's 2-nanometer process, with a target window of late 2027. This bifurcation – separate training and inference silicon – is new for the TPU line, and I'll explain why it matters when we get to the engineering section.

AI Engineer's Take: When you see chip procurement at the gigawatt level and six-year co-design contracts, you're not reading a chip story anymore. You're reading an infrastructure story. The relevant comparison isn't TPU vs. H100 per FLOP. It's whether you can get a hundred thousand chips delivered on a roadmap. Right now, the honest answer for Nvidia is: sometimes. For Google, at the scale of a frontier lab, it's becoming: usually.

What is a TPU, actually

Before we get into the engineering depths, let me give you the version that makes the architecture click intuitively.

A CPU is one very talented chef who can cook anything on any menu. Complex dishes, improvised specials, six different orders at once – but it's one chef, and there's a ceiling on throughput. A GPU is a thousand line cooks working in parallel. Great for high-volume repetitive work, but you still need to coordinate them and the kitchen is built for flexibility.

A TPU is a factory assembly line built to produce one product at enormous scale. Matrix multiplications. That's it. The station is fixed, the tooling is fixed, the throughput is extraordinary, and you would absolutely not order a bespoke omelette from it.

This turns out to matter enormously: a transformer model is, mechanically, a deep stack of matrix multiplications. Attention is a series of matrix products. The feed-forward layers are matrix products. The embeddings are matrix lookups and projections. When you strip away the mathematical abstraction, what you're running is a very expensive sequence of multiply-and-add operations on large rectangular arrays of numbers.

TPUs are built to make exactly that pattern fast and cheap.

The hardware mechanism that enables this is called a systolic array. Imagine data flowing through a grid of processing cells in a coordinated wave – each cell multiplies two numbers, adds the result to a running sum, and passes it to the next cell. No memory bus thrash. No waiting for data to arrive. The rhythm is set at design time, and then it just runs. The MXU (Matrix Multiply Unit) inside a TPU operates as a systolic array, and at the hardware level its tile size is 128×8. That alignment detail matters practically – we'll get to it in the engineering section.

AI Engineer's Take: The useful mental model isn't “TPU vs. GPU.” It's specialized hardware that wins when your workload matches the specialization. If your workload is a transformer – and in 2026, most production ML workloads are – you are leaving efficiency on the table running on hardware designed to be general-purpose.

How TPUs actually work

The Ironwood chip

The per-chip numbers tell a clearer story than the marketing does. Here's how the recent TPU generations stack up:

Metric	TPU v5p	Trillium (v6e)	Ironwood (v7)
FP8 TFLOPs	459	918	4,614
BF16 TFLOPs	459	918	2,307
HBM capacity	95 GiB	32 GiB	192 GiB
HBM bandwidth	2.77 TB/s	1.6 TB/s	7.4 TB/s
ICI bandwidth (bidir.)	1.2 TB/s	0.8 TB/s	1.2 TB/s
TensorCores per chip	2	1	2
SparseCores per chip	4	2	4

For context: the Nvidia B200 delivers approximately 4.5 petaFLOPs FP8 per chip with 192 GB HBM at 8 TB/s bandwidth. On a per-chip basis, Ironwood and the B200 are peers. That's the point – Google isn't claiming to beat Nvidia at the chip level. At the chip level, they've closed the gap.

TPU Generations: The Ironwood Leap. Per-chip specs comparing TPU v5p, Trillium (v6e), Ironwood (v7), and Nvidia B200. Ironwood delivers 10x the FP8 throughput of v5p and 4x Trillium, at roughly the same per-chip ballpark as Nvidia's B200. — Ironwood closes the per-chip gap with Nvidia's B200. The story isn't the chip – it's what comes next.

The chiplet design

Ironwood is a dual-chiplet design. Each chip contains two TensorCores, and each TensorCore bundles a Matrix Multiply Unit (MXU) with a Vector Processing Unit (VPU). The MXU handles the dense matrix math. The VPU handles elementwise operations – activations, normalization, residuals – that don't benefit from the systolic array.

Ironwood also ships with four SparseCores per chip. SparseCores are purpose-built for sparse and irregular access patterns: mixture-of-experts routing, embedding lookups, recommendation system operations. If you're running dense-only transformers, SparseCores sit mostly idle. If you're running MoE architectures – which the frontier labs increasingly are – they start pulling weight.

The 128×8 MXU tile size is a practical detail that bites you early in a TPU port. If your weight matrices don't align to multiples of 128 in the relevant dimension, XLA will pad them – and you'll pay for the padding in both memory and compute. Standard practice is to round embedding dimensions and hidden sizes to the nearest 128 multiple before targeting TPU. This is not a bug. It's the cost of the specialized architecture.

The pod is the computer

Here's where the TPU architecture stops looking like a chip story and starts looking like a network architecture story.

An Ironwood pod is 9,216 chips, liquid-cooled to handle approximately ten megawatts of power, connected via a three-dimensional torus topology using optical circuit switching with RDMA across chips. The total HBM addressable across a full Ironwood pod is 1.77 petabytes. Not the memory capacity of one chip. The memory capacity of the entire pod, addressable as a unified resource by the compiler.

The Pod IS the Computer: how Ironwood scales from one chip (4.6 PFLOPs FP8, 192 GiB HBM) to a tray of 4 chips, to a rack of 64 chips, to a 3D torus cube, to a full 9,216-chip Superpod delivering 42.5 ExaFLOPs and 1.77 PB of HBM. — From one chip to 9,216: the pod scales as a single coherent machine. The abstraction is the product.

That is the argument for TPUs at scale. Not the chip. The pod. Nvidia's NVLink domain tops out well below that coherence boundary. When you're training a model with trillions of parameters, the question stops being “how fast is one chip?” and becomes “how efficiently can I wire ten thousand chips together?” TPUs win at that question.

Model FLOPs Utilization

One practical metric that matters more than peak FLOPs is MFU – Model FLOPs Utilization. It measures how much of the hardware's theoretical peak you actually use during training.

GPUs, in practice, land between 30% and 50% MFU for large transformer training workloads. TPUs routinely hit 50% to 70%. Google Research has reported 95% scaling efficiency at 32,768 TPUs for specific transformer workloads – meaning near-linear throughput increase as chip count scales. No GPU cluster comes close to that at that scale.

The reason is the compilation model. XLA compiles the full computation graph ahead of time. It knows the data flow, the communication pattern, and the memory layout before anything executes. That allows it to schedule collectives and eliminate gaps that an eager runtime can't avoid.

The 8th-gen split

The decision to bifurcate the 8th-generation TPU line into Sunfish (training) and Zebrafish (inference) is the most architecturally interesting signal in the roadmap. Training and inference have divergent requirements: training needs high-bandwidth memory, maximum FP8 throughput, and tight collective communication. Inference cares more about latency, token throughput, and the ability to handle variable-length sequences efficiently.

Nvidia builds one chip and tunes the software stack. Google is betting that purpose-built silicon for each regime delivers better price/performance at scale. If they're right – and the emergence of inference-optimized chips from every major vendor suggests they are – this is where the market is heading.

AI Engineer's Take: The 1.77-petabyte memory address space of an Ironwood pod isn't a marketing line. It's an architectural property that changes how you think about model sharding. On GPUs, parallelism strategy is largely a manual exercise – you choose tensor parallelism, pipeline parallelism, and data parallelism, and you implement the collectives. On TPUs, the compiler handles much of this automatically once you specify the sharding constraints. That's the real efficiency story.

Working with TPUs as an AI engineer

The software stack

The compilation path looks like this: your model code, written in JAX or PyTorch, is lowered to StableHLO – a stable hardware-level IR that captures the computation graph without hardware specifics. StableHLO is then compiled by XLA (Accelerated Linear Algebra) into a TPU executable. XLA is the layer that knows about the physical chip, handles collective scheduling, and performs the memory layout optimizations.

JAX is the native path. JAX was built around XLA from day one – functional transforms like jit, vmap, and pmap are thin wrappers that generate XLA computation graphs. When you write JAX, you're essentially writing in a language XLA was designed to consume.

PyTorch is also supported via PyTorch/XLA, which translates PyTorch operations into XLA operations at runtime. The PyTorch/XLA 2.7 release introduced a Pallas-based ragged paged attention kernel that delivers up to 5× speedups on TPUs for variable-length sequences – which is significant for inference workloads where sequence lengths are unpredictable.

A detail worth knowing: vLLM's TPU backend now uses JAX→XLA as the lowering path for all models, including PyTorch-defined ones. The reason is straightforward – JAX is the more mature stack for parallelism primitives on TPU. Even if your model is defined in PyTorch, the inference serving layer may ultimately run through JAX internals.

The compilation model is a mental shift

On GPU, PyTorch runs eagerly. Operations execute immediately. You can put a print statement in the middle of a forward pass and see the tensor values. Debugging is tactile.

JAX on TPU compiles ahead of time. The first call to a JIT-compiled function traces the computation graph, compiles it to TPU bytecode, and caches the result. Subsequent calls execute the compiled artifact. Dynamic shapes break the cache and trigger recompilation. Variable-length sequences require padding or carefully structured bucketing. Control flow that depends on tensor values doesn't exist at compile time.

The transition from eager PyTorch to compiled JAX is the biggest adjustment for engineers coming to TPUs from a GPU-first background. The payoff is that once your model compiles cleanly, the runtime is extremely predictable – no JIT warm-up spikes, no garbage collection pauses, no surprise allocation failures mid-training run.

JAX sharding in practice

Distributed training in JAX centers on device meshes and sharding constraints. Here's the minimal version:

from jax.sharding import Mesh, PartitionSpec
from jax.experimental.shard_map import shard_map
import jax.numpy as jnp
import numpy as np

# Define a 2D device mesh: data parallelism x model parallelism
devices = np.array(jax.devices()).reshape(4, 2)  # 8 devices total
mesh = Mesh(devices, ('data', 'model'))

# Tell the compiler how a weight matrix should be sharded
from jax.sharding import NamedSharding
weight_sharding = NamedSharding(mesh, PartitionSpec('model', None))
# "shard along the 'model' axis, replicate across 'data'"

# Apply sharding constraint inside a jitted function
@jax.jit
def forward(params, x):
    w = jax.lax.with_sharding_constraint(params['w'], weight_sharding)
    return x @ w

The compiler inspects the sharding constraints and generates the collective communications – all-reduces, all-gathers – automatically. You describe the intended layout. XLA figures out how to move the data.

PyTorch's DTensor is explicitly modeled on this design. It's less mature, and lacks the same depth of compiler integration, but the mental model is converging. If you understand JAX sharding, PyTorch DTensor will feel familiar.

TPU vs. GPU: the practical decision framework

Reach for TPUs when:

You're training or serving at more than 1,000 chips
Your workload is dense-matmul-dominant – transformers, embeddings, large-batch matmul
You're willing to invest in the JAX workflow and accept the compilation model
You care about TCO and power efficiency at scale
Your workloads are stable enough that compilation cost amortizes
You're running large-batch inference or MoE decoding

Stay on GPUs when:

You're doing research with dynamic control flow, graph dependencies, or models that don't fit the static graph model
You need the broadest kernel ecosystem – Flash Attention, custom CUDA, triton kernels
Your team is small and eager-mode productivity is a real bottleneck
You're deploying on-prem or at the edge
You're building something that doesn't exist yet and need fast iteration

AI Engineer's Take: For most engineers reading this – the ones not training at frontier scale – PyTorch on H100 or H200 is still the right default. TPUs become genuinely compelling when the scale itself is the challenge. At 10,000 chips, the compilation cost looks like a rounding error, and the MFU difference starts showing up in real training costs. Below that threshold, the friction is harder to justify. The honest position is: know JAX, understand the stack, but don't force the migration until the scale demands it.

Why these deals were made: the AI engineer's read

Why Anthropic went a million deep

Anthropic's stated compute strategy is three-platform diversification: Google TPUs, AWS Trainium, and some continued Nvidia GPU access. That's not hedging. That's explicit supply chain architecture.

The reasoning is straightforward. When you're planning a training run that will consume more than a gigawatt of compute for months, the risk of a single-vendor disruption isn't theoretical – it's a planning input. A six-month delay on a chip delivery at that scale doesn't cost you money. It costs you competitive position.

The second factor is inference economics. Google has been explicit that Ironwood is designed for “the age of inference” – high-throughput, low-latency token generation at sustained load. James Bradbury, Anthropic's head of compute, has spoken to this – Ironwood's inference performance and training scalability were both flagged in Anthropic's announcement: as inference costs dominate the P&L, the chip optimized for that regime wins on TCO. At the volumes Anthropic is running Claude, that math moves the needle.

The third factor is that Google is the only vendor who can commit five gigawatts of dedicated compute on a multi-year timeline. That's not a sales pitch. That's a physical capacity constraint. The infrastructure to deliver that doesn't exist at most vendors in 2026.

Why Meta is testing TPUs

Meta is Nvidia's largest single customer. They were not looking at TPUs because Llama trains better on them. They were looking at TPUs because a credible second-source supply at scale is the only thing that forces a dominant supplier to compete on price.

Meta already signed a $60 billion deal with AMD – another second-source signal. The TPU evaluation serves the same strategic function: demonstrate to Nvidia, with real engineering investment, that the alternative is viable. The moment the alternative is viable enough to run production workloads, the negotiation changes.

At volumes that move Nvidia's quarterly earnings – hundreds of thousands of chips per year – even a 5% price reduction compounds into hundreds of millions of dollars in margin. The cost of the engineering investment to make TPUs work is a rounding error against that.

What Google actually wants

Google is not trying to dethrone Nvidia. That's a media framing, not a strategic goal. Google's actual target is capturing 10–15% of frontier training workloads and a meaningfully larger share of inference – because inference is where the sustained compute demand lives post-deployment.

The 8th-generation architectural split – training-class Sunfish, inference-class Zebrafish – tells you exactly where they see the market going. They're not building one chip to rule everything. They're building purpose-built silicon for the two regimes that matter at scale.

Broadcom is the underappreciated actor in this story. Co-designing the silicon, running the high-speed interconnect layer, and locked in through a $46 billion contract to 2031 – Broadcom is the manufacturing and networking partner that lets Google scale without building a semiconductor fab. That's the partnership that makes the supply chain story coherent.

AI Engineer's Take: The most interesting procurement decision in this landscape isn't Anthropic buying a million TPUs. It's Meta – the company that arguably benefited most from Nvidia's dominance, with the engineering resources to optimize any hardware – putting real engineering investment into a TPU evaluation. That's a signal. When the biggest buyer starts seriously qualifying alternatives, the market structure changes.

What this means for the broader ecosystem

CUDA's moat is real but no longer infinite

CUDA has a fifteen-year head start on kernel development, tooling, and engineering familiarity. That doesn't disappear. What's changed is the top of the stack: frontier labs are porting to JAX, vLLM is consolidating on JAX→XLA for TPU inference, and the new capability releases from Google DeepMind are often published in JAX first.

The moat is real at the midmarket level – teams of ten engineers, research workflows, fine-tuning pipelines. It's softening at the frontier, because the organizations with the resources to absorb the transition cost are the ones with the most to gain from the TCO improvement.

The frontier is bifurcating

Training-class chips and inference-class chips are diverging – not just in architecture, but in procurement, pricing, and deployment model. Nvidia saw this coming with H100/H200 for training and the Blackwell line's inference-tuned variants. AMD is following. Google's Sunfish/Zebrafish split is the clearest articulation of the bifurcation yet.

For engineers, this means the question “what chip should we run?” is increasingly two questions: “what chip should we train on?” and “what chip should we serve on?” The answers may differ, and the cost structures definitely do.

Power is the binding constraint

Ten gigawatts of reserved compute is the headline that should matter most to anyone planning infrastructure at scale.

The binding constraint in AI infrastructure is no longer chip availability – lead times have improved as TSMC and the packaging supply chain have expanded. The constraint is power. Site selection, power purchase agreements, grid interconnection timelines – these are the variables that determine whether a compute commitment can actually be delivered. The labs that move earliest on securing power capacity have a structural advantage that's hard to replicate.

If you're an engineer at a company planning for training at scale, the conversation that matters most isn't with your chip vendor. It's with whoever owns your data center power contracts.

Walled gardens are softening

Every major hyperscaler with custom silicon is now selling it externally. AWS Trainium is publicly available. Google TPUs have been on Cloud for years, and Ironwood just hit general availability. Microsoft's Maia is in preview. The distinction between internal silicon and merchant silicon is blurring.

This is good for the ecosystem. It means frontier hardware is available to teams that aren't named Google or Amazon. It means the software stacks – JAX, PyTorch/XLA – have external users driving quality. And it means the competition for inference workloads specifically will be intense, which generally moves prices in one direction.

AI Engineer's Take: The shift to purpose-built inference silicon is the change that will matter most for production engineers over the next three years. Training runs happen once per major model version. Inference runs continuously, forever, at scale. The chip that's 40% cheaper per token for inference wins more than the chip that's 20% faster for training.

The practical call

For most workloads at most companies, the switch from PyTorch isn't worth it yet. PyTorch isn't going away – the ecosystem is too large and productive to be displaced quickly. The tooling, the debugging experience, the kernel library: PyTorch on GPU still wins for most teams most of the time. But “most teams most of the time” is a shrinking category. The scale at which TPUs become the right answer keeps moving down, and the software stack keeps improving.

The path in is not steep. JAX's functional style is initially disorienting if you're coming from PyTorch, but the mental model is consistent: pure functions, explicit state management, ahead-of-time compilation. Once it clicks, it's elegant. The sharding primitives, in particular, are better designed than anything in the PyTorch ecosystem today.

The companies training the next generation of frontier models have already made the bet. The infrastructure they're building on runs on JAX and XLA. That's the terrain that's going to produce the models the rest of the industry fine-tunes and serves.

You want to know how that terrain works.

Learn JAX.

Built by an AI Engineer. Not a journalist.

Follow along for more AI research breakdowns.

← Back to Context Window