Upcurious

How Agentic Work Broke Token Pricing

Upcurious — Mon, 27 Apr 2026 09:50:37 GMT

The instinct is to rank AI models on price-per-token and SWE-Bench score. That ranking no longer tracks what users actually pay.

Three signals broke at once:

Tokenizers became a hidden price lever. Opus 4.7 raised bills up to 35% without touching the sticker.
Benchmarks became an optimization target. Three Chinese open-weights labs cleared SWE-Bench Pro inside the 18-day window.
The harness layer now drives more cost-per-task variance than the model name does. SemiAnalysis measured Claude Code consuming roughly 25% more input context per output token than Codex on equivalent coding work, on the same underlying model.

The unit enterprises actually care is cost per successful task: token price × tokenizer expansion × harness token behaviour × retries, divided by task success rate.

What Landed in 18 Days

Three patterns visible across the cluster.

Closed labs are raising prices, sometimes invisibly. Opus 4.7 took the hidden route: the sticker did not move, the tokenizer did. GPT-5.5 took the visible route, doubling per-token pricing against GPT-5.4 to $5 / $30, with a Pro tier at $30 / $180. Anthropic’s other play, Mythos (unreleased frontier model), restricted to a small set of launch partners and not slated for general release, has reported SWE-Bench Pro at 77.8%, well above anything publicly comparable. The price moves and the Mythos restriction are two angles on the same play: protect margin where capability is genuinely above the open-weights wave; raise prices where it is not.

Open-weights labs are pricing for adoption, not yield. MiMo V2.5 Pro at $1 / $3 per million tokens is the load-bearing data point on this side, roughly a fifth of GPT-5.5 at frontier-adjacent capability on the public benchmarks. That is not a sustainable yield-curve price. It is a market-entry price: get production workloads onto the platform, accumulate harness-tuning data, price up later. Z.ai and Moonshot are running similar economics. Real per-task gaps are smaller than the 5x sticker once harness behaviour is accounted for, still large enough to move procurement.

Meta added a closed consumer model at the open-weights peak. Muse Spark landed two days after GLM-5.1 became the first open-weights model to top SWE-Bench Pro. That does not mean Meta has abandoned open source. Zuckerberg has said Meta still plans new open-source models. It does mean Meta is separating its consumer-product strategy from the raw model-access strategy: frontier capability can be routed first into Meta AI, with planned rollout across WhatsApp, Instagram, Facebook, Messenger, and Meta smart glasses, without being released as weights on day one. It is a different vertical-integration move from Anthropic’s, on a different end of the stack.

This is the cost-per-task argument in action. Running the full Artificial Analysis Intelligence Index evaluation costs $4,811 on Opus 4.7 max-config and $3,357 on GPT-5.5 xhigh, against $1,071 on DeepSeek V4 Pro Max and $462 on MiMo V2.5 Pro. The closed-source max configurations sit at roughly 10x the open-weights frontier on the same benchmark suite, and reasoning cost dominates that bill: 60–75% of the total on closed-source max configurations, close to zero on the open-weights side. Configuration is itself a cost lever within a single model. GPT-5.5 ranges from $1,199 (medium) to $3,357 (xhigh), a 2.8x spread on the same weights.

The Public Signal Is Now Broken

Five things have gone wrong with the signals at the same time.

Benchmark optimization has saturated the open-weights cluster. SWE-Bench Pro was meant to be the contamination-resistant successor to SWE-Bench, with private repositories used to keep training data clean. Within months of the benchmark becoming the industry reference, three Chinese open-weights labs have published models that occupy the top of the leaderboard, all within a 18-day window. That is not impossible to achieve honestly, and the underlying capability is real. It also fits the pattern of a benchmark that has become the optimization target rather than the measurement instrument. The contamination asymmetry runs in one direction: open-weights labs face fewer institutional checks on training data composition than closed labs do, and the cost of getting caught is lower. The more defensible read is not that the scores are fake; it is that leaderboard position now needs production evidence before it should be treated as durable capability.

Closed labs curate benchmark framing, too. OpenAI publicly campaigned for SWE-Bench Pro as the right industry benchmark in February 2026, and the GPT-5.5 announcement did publish a SWE-Bench Pro score: 58.6%, with a memorization caveat. That is better than omission, but it does not remove the disclosure problem. Closed labs still choose which benchmarks get headline treatment, which internal evals stay private, and how caveats are framed. SemiAnalysis’s broader read is that Mythos and Opus 4.7 remain stronger in some important zones even where GPT-5.5 looks better on selected public numbers. The benchmark game is asymmetric on both sides. Open-weights labs optimize for the score; closed-source labs optimize the disclosure frame. Both compromise the public signal.

Opus 4.6 and 4.7 had visible production trust issues. Anthropic’s Claude Code product carried three production bugs that affected many users before the fix posted. Engineers using Claude Code through the 4.6 and 4.7 cycle reported behaviour changes that did not match the clean release-note story: more brittle instruction-following, unexpected regressions, and tool-use issues that varied by harness and effort level. The important point is not intent. It is that the closed-source trust premium (the reason an enterprise pays $5 / $25 for Opus rather than $1 / $3 for MiMo) depends on stability, disclosure, and predictable migration behaviour. Anthropic spent some of that premium in this cycle.

Tokenizer drift, in detail. Opus 4.7’s change is the cleanest case. A workload migrated from 4.6 to 4.7 today can cost up to 35% more for the same inputs and outputs, with all of the change driven by how the model encodes text rather than what the model does with it.

Harness as the dominant cost variable. A coding agent does not run a single forward pass. It plans, reads files, drafts, runs tests, reads errors, redrafts. Each step is a token decision driven by the harness: what context to load, when to compact, what tools to call, when to stop. Codex and Claude Code on the same underlying model produce materially different per-task costs. Cursor and OpenCode on the same model produce materially different costs again. The harness is now part of the product, and it is where most of the cost-per-task delta lives. Two of the three inputs to real cost — tokenizer behaviour and harness behaviour — sit outside the standard pricing comparison.

Updating the Demand Engine Thesis

Two weeks ago, Anthropic’s Managed Agents launch read as a vertical-integration play: Anthropic capturing the runtime moat by selling a managed agent platform on top of its model rather than letting customers wire up their own. That framing is intact. The April cluster bounds the upside of that thesis tighter than the prior piece suggested.

Even discounting the benchmark-gaming concern from the prior section, the floor under managed pricing is bounded by two things that do not depend on leaderboard truth: open-weights pricing and architectural memory efficiency. MiMo at $1 / $3 versus GPT-5.5 at $5 / $30 is a procurement decision regardless of who tops SWE-Bench Pro. DeepSeek V4 cutting KV cache by an order of magnitude is structural inference economics regardless of any capability claim. The floor is no longer one open-weights model. It is a wave. GLM-5.1 cleared the closed-source coding leaderboard at zero per-token cost beyond inference compute. Kimi K2.6 followed within two weeks. An enterprise that wanted to run a long-running agent workflow on closed infrastructure two weeks ago was paying to keep HBM pinned for the length of the session. That same enterprise today has at least four credible self-hosting options, three with open weights, one with first-party pricing that approaches commodity.

This does not collapse the managed-runtime business. It bounds the pricing power. Anthropic and OpenAI can charge a premium where reliability, security review, accountability, and integrated tooling matter more than unit cost, which is most regulated enterprises today. They cannot charge an unbounded premium, because the cost of the alternative just dropped sharply on multiple axes simultaneously. The premium is bounded above by what an enterprise can replicate with open weights and competent infrastructure, and that ceiling moved down materially.

Deepseek V4’s reception is the cleanest case of the broken signal. Every public signal interprets this release as underwhelming. The architecture says the opposite.

The disappointment narrative is the public signal failing to capture the structural one.

Efficiency, Jevons, and the Demand Engine

The compression wave is real. The demand engine is also real. The empirical question is whether efficiency outpaces usage growth, or usage growth absorbs the efficiency gain.

The OpenRouter scoreboard is the cleanest public signal on this. Weekly token volume on the platform is up roughly four times year-over-year, from approximately 5 trillion in April 2025 to over 20 trillion in April 2026. The fastest-growing behaviour is agentic inference: single requests pushed from thousands of tokens to 100K-1M tokens, run in long sequences with tool calls. That growth happened against continuous efficiency improvement: better quantization, sparse attention, prompt caching, KV compaction. None of it slowed the demand curve.

The composition shift on OpenRouter is the second signal worth holding. Chinese-origin models now account for over 45% of platform traffic, up from under 2% in late 2024. That number reflects two things at once: real-world quality adoption (engineers running production workloads on cheaper open-weights models that perform well enough) and price-driven traffic that does not reflect quality choices. Both interpretations advance the cost-per-task argument: enterprises are increasingly willing to migrate workloads onto cheaper models, either because the quality holds or because the cost differential is large enough to absorb the quality risk.

DeepSeek V4 will follow the same pattern. This is the Jevons shape: cutting long-context memory by an order of magnitude does not deflate inference demand, it expands the set of workloads where million-token context is economically justifiable, which means more million-token sessions, which means more sustained HBM demand in aggregate even as each session uses less HBM per unit of work. Total memory load goes up unless capability ceilings hold demand back, and the demand ceiling has been receding for three years.

The counter evidence worth holding: capability dominance. If Mythos, Opus 4.7, and GPT-5.5 widen the outcome-quality gap meaningfully against open weights (and if benchmark-optimization on the open-weights side reflects gaming rather than real capability), cost-per-task comparisons matter less because only the frontier closed model finishes the job reliably. SemiAnalysis’s read is that the capability gap still applies in narrow zones: Mythos on cybersecurity, Opus on long open-ended tasks, GPT-5.5 on dense reasoning. Outside those zones, the cost frame increasingly drives selection. Leaderboard dominance from the open-weights cluster does not by itself prove the cost frame has won. It does confirm the cost frame is now being seriously tested.

What This Means for Where Value Sits

The thesis from this series holds, with the following updates.

Memory, packaging, and power names: thesis intact and reinforced. Seven frontier-class products land on the same constrained infrastructure. SK Hynix sold out through 2026, Micron sold out through 2026, TSMC CoWoS reserved by NVIDIA. The DeepSeek and MiMo efficiency improvements expand the demand pool rather than relieve the supply constraint. The bottleneck names remain the cleanest public expression of this trade.

Harness layer: analytically real, structurally private. Cursor, Cognition, OpenCode, and the first-party harnesses inside frontier labs are where cost-per-task is determined. There is no clean public-equity expression of this layer today. Watch for acquisitions, IPO filings, and any disclosure of session-level economics. Anthropic’s eventual S-1, rumoured for October 2026, will be the first published view of harness economics from the inside.

Managed-runtime thesis: bounded above by an open-weights wave, not one model. The prior piece’s read on Managed Agents was directionally right. The ceiling on that thesis is now visible and lower than two weeks ago. The product is real, the moat is real, the pricing power is bounded by what an enterprise can replicate with open-weights infrastructure at frontier-adjacent capability, which is now four models, including one (MiMo) with first-party pricing at roughly a fifth of the closed-source frontier.

At risk: API-resold middleware. Companies that built features on top of frontier APIs without owning either the model or the harness layer face compression from both directions. Frontier labs are descending into the harness. Open weights are rising into capability ranges that previously required closed APIs. The middle of the stack is where the squeeze lands first.

Trust premium: under explicit pressure. Open-weights pricing is now low enough and capability close enough that enterprises will require predictable migration behaviour, stable tokenizers, and a low-babysitting harness to justify the closed-source premium. If closed labs spend that capital faster than Mythos-tier capability replenishes it, the cost-per-task argument starts winning workloads where it currently does not.

Subscription pricing is breaking too, for the same reason. Cursor shifted from “unlimited” to capped and usage-metered tiers in 2025 after power users running coding agents burned through subscription compute orders of magnitude faster than the plans were designed to absorb. Claude Pro and Max have weekly limits and throttling on heavy users. ChatGPT Plus has had message caps from the start. GitHub Copilot, Replit, and Cognition’s Devin have all moved toward usage-metered or “premium request” tiers. Per-token pricing breaks the procurement comparison; subscription pricing breaks the consumer plan. Both were structures designed for an earlier consumption pattern, and both are compressing under the same agentic load.

What We Are Watching

Big Tech earnings, in flight now. Meta, Google, Microsoft, and Amazon report through late April and early May. The number that matters is whether 2026 capex guidance moves up against the agent-driven demand signal. Hyperscaler capex revisions are the next confirmation point.

Real-world adoption of the open-weights cluster, and migration patterns inside the closed-source pair. OpenRouter share is the cleanest leading indicator on the open-weights side. If GLM-5.1, Kimi K2.6, and MiMo V2.5 Pro accumulate share on production workloads, not just price-driven test traffic, the benchmark-optimization skepticism gets resolved one way. The companion signal sits inside the closed-source pair: if teams using Claude Code start switching to Codex on GPT-5.5 for execution work, the cost-per-task argument is being priced in real time. If migration stalls because the harness gap closes more slowly than the model gap, the harness layer just got more valuable.

Mythos release timeline. A general-availability date for Mythos would be a meaningful re-rating event for both inference demand and capability competition. It would also be the cleanest counter evidence for the cost-per-task thesis, if Mythos’s capability lead is large enough to make the cost frame irrelevant in the workloads that matter most.

Anthropic S-1 filing. Rumoured for October 2026. The single most informative document this thesis is waiting on. It will resolve the gross-versus-net revenue question, expose session economics, and put the managed-runtime moat on a balance sheet the market can price.

This is part of an ongoing series on AI infrastructure economics. Earlier installments covered the bottleneck sequence, the memory crunch and the compression wave moving against it, and the model race as the demand engine behind the infrastructure thesis. This piece updates the demand engine read for the seven-release April cluster, and bounds the managed-runtime thesis against the open-weights wave.

For more on the AI stack and where value flows, visit theupcurious.com.

The Model Race Is Heating Up

Upcurious — Thu, 16 Apr 2026 06:02:19 GMT

Anthropic reported $30 billion in annualised revenue, launched a fully managed agent infrastructure platform, and quietly distributed its most capable model to a restricted group of partners without releasing it to the public. OpenAI’s chief revenue officer responded with a four-page internal memo arguing Anthropic is gaining momentum in enterprise — and separately, Sam Altman confirmed that pre-training for OpenAI’s next frontier model finished on March 24, with a release expected within weeks.

More enterprise customers means more sustained inference load on the same constrained stack. The model race is the demand engine.

What the Revenue Numbers Actually Say

Anthropic said on April 6 that its run-rate revenue had surpassed $30 billion in 2026 — the first time any rival has publicly claimed to match or surpass OpenAI’s run rate of $25 billion. There is a caveat.

OpenAI’s CRO pushed back in an internal memo, first reported by CNBC and later published in full by The Verge, arguing Anthropic’s number is inflated by roughly $8 billion. The accounting difference: Anthropic counts gross revenue from cloud partners, booking the full amount billed through AWS and Google Cloud before those platforms take their cut. OpenAI reports net of Microsoft’s share. Adjusted for that, the real comparison is closer to $22 billion versus $25 billion — OpenAI still ahead.

The accounting dispute matters less than the enterprise momentum in Ramp’s data. In Ramp’s April 15 AI Index update — built from corporate card and bill-pay flows across 50,000+ US businesses — Anthropic reached 30.6% of AI-paying businesses versus OpenAI’s 35.2%, cutting the gap to 4.6 percentage points. That is not a direct measure of token demand, but it is one of the clearest public indicators of enterprise customer momentum. OpenAI’s own CRO considered that momentum serious enough to write a four-page memo.

The revenue composition gap reinforces it. Anthropic’s mix skews heavily toward enterprise — CEO Dario Amodei has cited roughly 80% of revenue from business customers — while OpenAI carries a larger consumer share. Enterprise workloads run longer, burn more tokens per session, and sustain higher memory loads than chatbot interactions. As Anthropic gains enterprise share, the per-customer infrastructure demand rises with it.

What Anthropic Is Not Releasing — and Why

On April 7, Anthropic announced it had built what it described as the most capable model it has ever created. Then it did not release it.

Anthropic’s Project Glasswing gives Mythos Preview to a restricted set of launch partners and additional critical-infrastructure organizations for defensive cybersecurity work, underscoring both the model’s capability and Anthropic’s decision not to release it broadly. Twelve named launch partners — including AWS, Apple, Google, Microsoft, and NVIDIA — received access, alongside more than 40 additional organisations that build or maintain critical software infrastructure.

The benchmark numbers explain the caution. Mythos leads 17 of 18 benchmarks Anthropic measured. On SWE-Bench Verified — real-world software engineering — it scores 93.9%, up from 80.8% for the prior model. A 13-point gain in a domain where 2-3 points was previously considered meaningful progress. On CyberGym, the cybersecurity capability benchmark, it scores 83.1%. In one test, it developed 181 working browser exploits from vulnerabilities it autonomously identified, versus 2 for the prior model.

That last number is the reason for restricted release, stated plainly. The same capability that finds vulnerabilities at scale can be redirected to weaponise them.

The strategic implication is less obvious. Anthropic is holding a capability tier above what is publicly available — deliberately, with named enterprise partners, as a demonstration of safety discipline. This is not just caution. It is positioning. The enterprises deploying AI in regulated, high-stakes environments have a procurement story to tell their boards. Anthropic is building that story one announcement at a time.

Whether Mythos eventually reaches general availability is not confirmed. What the restricted access does establish is a demand signal: when a model at this capability level deploys at broader scale, inference load goes up again. That timeline is unknown, but the direction is not.

The Inference Supercycle

In 2023, training dominated AI compute. Building the models consumed most of the resources. Running them was secondary.

That has reversed. Inference now accounts for roughly two-thirds of all AI compute in 2026, up from one-third in 2023 and half in 2025, according to Deloitte’s 2026 TMT Predictions report. The shift is structural. Enterprises are not experimenting anymore — they are running AI in production, continuously, at scale.

Claude Managed Agents, launched in public beta on April 8, moves Anthropic further from selling model access alone toward operating agent infrastructure on behalf of customers. Define the agent in natural language or YAML (a configuration file), set the guardrails, and Anthropic handles hosting, scaling, and monitoring. Sessions are persistent. Context is maintained across the full task lifecycle. Runtime is billed at $0.08 per session hour, on top of standard token costs.

What Managed Agents actually represents is a shift in the unit of compute. A chatbot interaction is a round trip — a few seconds, a few thousand tokens, done. A managed agent session working through a codebase or a complex document runs for minutes or hours, consuming 100K to 500K tokens before it finishes. Previously, deploying that kind of session in production required your team to build and operate the infrastructure. Anthropic is now absorbing that entirely. The compute demand per customer goes up. The barrier to creating it just dropped by an order of magnitude.

Jevons paradox is already visible — the idea that making something cheaper and easier to use tends to grow total demand, not shrink it. Managed Agents reduces the fixed engineering cost and operational friction of running an agent workflow. The runtime fee is only part of the bill; the bigger change is that Anthropic is absorbing the sandboxing, orchestration, session management, and monitoring work that previously took teams months to build. When agent infrastructure gets easier to deploy, more companies deploy agents. When more companies deploy agents, total token volume rises. That is the pattern that has defined every prior technology supercycle — efficiency enabling more usage, not less.

What This Means for How You Work

The practitioner shift is already visible in the tooling. Ninety days ago, AI coding tools mostly behaved like assistants: you wrote the prompt, they returned a draft, and you reviewed it. Now the workflow is increasingly supervisory. Tools like Claude Code can break work into subtasks, operate over longer sessions, and come back with code, tests, and follow-up actions. The job is moving from writing better prompts to defining the task, setting constraints, and reviewing the result.

That changes the economics as well as the workflow. A short assistant interaction might use a few thousand tokens; an agentic session working through a real codebase can run far longer and consume materially more compute before it is done. Managed runtimes extend that pattern beyond code into research, operations, and back-office workflows. The human is still in the loop, but later in the process: less step-by-step execution, more specification and review. From the infrastructure side, that means more compute consumed per completed task, not less.

Where Value Accretes in the Stack

The model race has one clear beneficiary at the infrastructure layer: NVIDIA. More inference sessions mean more GPU utilisation. Longer, persistent sessions mean more HBM (high-bandwidth memory) demand per GPU. Neither Anthropic nor OpenAI is building infrastructure that reduces demand for Blackwell or Vera Rubin. They are filling it.

The underlying supply picture hasn’t changed — HBM is sold out through end of 2026, CoWoS (TSMC’s advanced chip packaging) capacity is dominated by NVIDIA reservations, and neither constraint eases because the demand driver changed form. Managed Agents is a new demand signal layered on top of an already constrained stack. Watch for capex guidance in Big Tech earnings starting April 22 — any upward revision in inference infrastructure spend is the first public confirmation of the agent-driven shift.

The more interesting question is what happens in the middle. Enterprise software companies that built their value proposition on “we’ll handle the AI complexity for you” face a different competitive environment when Anthropic is running the agents directly. Notion is an early Managed Agents adopter — which means Notion is now a customer of Anthropic’s infrastructure layer, not just its models. The orchestration, session management, and monitoring that middleware vendors and integration platforms previously sold as differentiation is now a platform feature at $0.08 per hour.

The vertical integration runs in both directions. As the labs move down toward infrastructure, the value of the application layer becomes harder to defend on anything except proprietary data, workflow depth, or distribution. Building on top of a lab that is also building your product’s core infrastructure is a different strategic position than it was eighteen months ago. Companies that assumed the frontier labs would stay in their lane are finding that assumption was wrong.

What Could Break This

The inference demand thesis weakens if efficiency compounds faster than usage.

Managed Agents supports built-in prompt caching, compaction, and other context optimizations. If those systems deliver high cache hit rates or reduce repeated long-context reads, the effective compute footprint per workflow could come in below the raw token math. Anthropic has not published workload composition or realized cache-hit data. Until it does, this is an unknown variable with real weight.

The enterprise crossover story deflates if OpenAI’s push accelerates faster than Anthropic can convert. OpenAI is already at 40% enterprise revenue, up from 30% a year ago, and is not standing still. The Ramp data shows Anthropic momentum, not a completed crossover.

Open-weights models are the structural floor under both. Meta’s Llama 4 supports a ten-million token context window and runs on commodity infrastructure. If open-source frontier models reach enterprise-grade reliability, enterprises running their own inference infrastructure do not add demand to the constrained GPU clusters that Anthropic and OpenAI are filling. They move the load elsewhere — and the managed inference moat both companies are building becomes harder to price above cost.

What We Are Watching

Enterprise customer share crossover. The Ramp AI Index has Anthropic closing fast — from 24.4% to 30.6% in the latest monthly update. The next Ramp release will either confirm the trajectory or show it stalling.

Managed Agents adoption velocity. The list of enterprise names announcing integrations over the next 90 days will indicate whether the managed-runtime model is clearing the market.

Big Tech earnings, starting April 22. TSMC’s Q1 results confirmed 35% revenue growth and CoWoS capacity dominated by NVIDIA reservations. The question for Meta, Google, Microsoft, and Amazon is whether capex guidance reflects agent-driven inference load. Any upward revision in inference infrastructure spend is direct confirmation of the demand signal we are tracking.

Next frontier models: Mythos and Spud. Mythos Preview is now accessible through Amazon Bedrock as a gated research preview — allowlisted organisations, cybersecurity use cases only, US East region. That is a narrower channel than general availability, but it is no longer purely internal. Watch how quickly the allowlist expands and whether use cases beyond cybersecurity get cleared. On the OpenAI side, Spud remains a rumour with no public benchmarks or confirmed release date, despite pre-training completing on March 24. Altman described it as a model that could “really accelerate the economy.” If Spud launches with a step-change in capability, it raises the inference demand ceiling and resets the competitive framing — in both directions.

IPO filings. Anthropic is rumoured for October 2026. The S-1 (IPO prospectus) will resolve the gross versus net revenue question definitively, and will be the first public view of actual token consumption, session economics, and infrastructure cost structure. That filing will contain more useful data about the inference supercycle than any estimate produced before it.

This is part of an ongoing series on AI infrastructure economics. Part 1 covered the full bottleneck sequence — power, packaging, fabs, and EUV. Part 2 covered the memory crunch and the compression wave building against it.

For more on the AI stack and where value flows, visit theupcurious.com.

Sources

Revenue and enterprise share

Anthropic $30B ARR (Google/Broadcom compute partnership announcement, April 2026): https://www.anthropic.com/news/google-broadcom-partnership-compute
OpenAI CRO internal memo: The Verge — https://www.theverge.com/ai-artificial-intelligence/911118/openai-memo-cro-ai-competition-anthropic
Ramp AI Index April 2026 — “As AI adoption crosses 50%, the tokenmaxxing economy splits off and up” (Anthropic 30.6% / OpenAI 35.2%): https://ramp.com/leading-indicators/the-tokenmaxxing-economy-splits-off-and-up

Mythos and Project Glasswing

Project Glasswing launch (April 7, 2026): https://www.anthropic.com/glasswing
Mythos Preview benchmarks and technical write-up: https://red.anthropic.com/2026/mythos-preview/
Mythos Preview on Amazon Bedrock (gated research preview): https://aws.amazon.com/about-aws/whats-new/2026/04/amazon-bedrock-claude-mythos/

OpenAI Spud

Leaked memo and Altman pre-training confirmation: https://the-decoder.com/openais-leaked-memo-says-new-spud-model-will-make-all-its-products-significantly-better/

Inference supercycle

Inference = 2/3 of AI compute: Deloitte 2026 TMT Predictions — https://www.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2026/compute-power-ai.html
Computerworld CES 2026 supporting coverage: https://www.computerworld.com/article/4114579/ces-2026-ai-compute-sees-a-shift-from-training-to-inference.html
Claude Managed Agents overview: https://platform.claude.com/docs/en/managed-agents/overview
Claude Managed Agents pricing: https://platform.claude.com/docs/en/about-claude/pricing

Infrastructure

SK Hynix HBM sold out through 2026: https://www.techspot.com/news/110058-sk-hynix-completely-sells-out-semiconductor-supply-ai.html
Micron HBM sell-out confirmation: https://www.trendforce.com/news/2025/08/13/news-hbm-battle-heats-up-micron-reportedly-hints-2026-sell-out-sk-hynix-yet-to-confirm/
TSMC CoWoS / NVIDIA capacity dominance: https://www.digitimes.com/news/a20260410VL204/packaging-capacity-tsmc-nvidia-demand.html

The Quiet Force Reshaping AI Infrastructure

Upcurious — Thu, 26 Mar 2026 04:45:16 GMT

Most AI infrastructure conversations revolve around compute: how many H100s a hyperscaler has ordered, what NVIDIA’s next chip roadmap looks like, how many exaflops a data centre can deliver. This framing is understandable but incomplete. As AI models get deployed at scale for longer and longer conversations, a different bottleneck is quietly becoming the dominant cost driver in production inference: memory.

Specifically, a component called the KV cache — and the expensive, constrained chip technology required to hold it.

The memory crunch is no longer a niche concern. Micron just reported the strongest quarter in its history. HBM is sold out through 2026 across all three producers. Lead times are years, and the supply side cannot catch up. Everyone knows this part of the story. The harder question is why demand broke so far past what anyone modeled — and whether the efficiency wave building against it will move fast enough to change the investment math.

The Demand Math Broke

The memory demand was based on chatbots. A user sends a message. The model responds. A few thousand tokens. The GPU moves on. Under that model, memory was a commodity input — important but predictable.

Two things broke that assumption simultaneously.

The KV cache explosion. Every transformer stores intermediate results — key and value matrices — for every token in the conversation. Think of it as the model keeping a detailed notepad of everything it has already read, so it only needs to process what is genuinely new. This KV cache lives in HBM, the high-bandwidth memory stacked on the GPU, and it must stay there for the entire session.

The size scales linearly with context length. And context windows have gone from 8K to 200K to 1M in three years. Here is what that does to a single session on a large model:

The industry is building toward the bottom of that table. Every row down doubles the memory bill.

Agents changed the unit economics. A chatbot interaction costs under 2,000 tokens. An agentic task — reading a codebase, writing code, running tests, debugging, iterating — costs 100,000 to 500,000 tokens compressed into a single growing context, with the model reasoning internally at every step. The model sees the full history on every forward pass. Run two agents in parallel and you hold two full KV caches in HBM simultaneously. The multiplier versus chatbot economics is not 5x. It is 50 to 100x per unit of user-facing work.

This is not hypothetical. OpenClaw — an autonomous agent that taps multiple models including ChatGPT, Gemini, and Claude, running them continuously and in parallel — has become the hottest productivity trend around the world. OpenRouter data shows weekly token consumption jumping from 1.62 trillion in March 2025 to 20.4 trillion in March 2026. A 12x increase in one year. A human programmer might use several million tokens a day. OpenClaw can burn through billions, with costs reaching thousands of dollars per session.

Jensen Huang made the implication explicit at GTC 2026 on March 16: NVIDIA is reviewing a plan to provide engineers with token allocations equal to half their base salary — roughly $187,500 per engineer per year in token resources. When the CEO of the company that makes the GPUs starts treating tokens as compensation, the demand curve has moved past “early adoption.”

The infrastructure models everyone built before 2024 assumed chatbot economics. The agent era broke that assumption, and nothing about the trajectory suggests it is slowing down.

Micron Just Showed You What a Supply Ceiling Looks Like

We covered Micron’s Q2 FY2026 results in detail last time. The short version: flat volumes, exploding prices. DRAM prices up mid-60s percent sequentially while bit shipments grew only mid-single digits. That is what a physical ceiling looks like — producers cannot ship more, so every dollar of incremental demand shows up in price.

The supply picture has not changed. HBM is sold out through 2026 across all three producers. Micron guided Q3 revenue to $33.5 billion at 81% gross margins, and raised FY2026 capex above $25 billion — money that produces wafers in 2027 and 2028, not now. The hyperscalers have pre-bought the available supply: Microsoft’s 2026 capex is projected above $120 billion, and the Stargate joint venture — backed by SoftBank, OpenAI, Oracle, and others — is a $500 billion, four-year commitment.

The supply crunch is real. But the cause is not primarily a chip story. It is a software story — agents, reasoning, long context — that shows up in the hardware. Micron’s income statement is the proof.

Now the Software Is Attacking the Problem It Created

Here is where it gets interesting.

On March 24, Google Research published TurboQuant. If you only read one AI paper this quarter, make it this one.

The KV cache — the notepad that consumes all that expensive HBM — is traditionally stored at 16-bit floating point precision. TurboQuant compresses it to 3 bits per value, with no retraining or fine-tuning required. That is a roughly 6x reduction in KV cache size. The 70B model that needed 305 GiB of KV cache at 1 million tokens? Now it needs about 50 GiB. A workload that required six H100s can potentially be served on two.

The mechanism is elegant. TurboQuant applies a random rotation to the data vectors (a technique called PolarQuant), which simplifies the geometry enough to apply aggressive quantization. Then it uses a 1-bit error correction algorithm (QJL) to eliminate the bias that aggressive compression normally introduces. The result: 3-bit KV cache that scores identically to full precision across every standard benchmark Google tested — LongBench, Needle in a Haystack, ZeroSCROLLS, RULER. No quality loss. No retraining. Drop-in deployment.

Google tested it on Gemma and Mistral. At 4 bits, TurboQuant achieves up to 8x speedup in computing attention on H100 GPUs versus unquantized keys. The paper will be presented at ICLR 2026.

The “training-free” property is what makes this different from prior KV compression work. Most techniques — KIVI, KVQuant — require model fine-tuning or custom infrastructure. TurboQuant applies at inference time. Any model that already exists can benefit. That means adoption can move fast.

TurboQuant Is Part of a Larger Wave

TurboQuant matters. But zoom out. It is one of three vectors attacking the same problem, and they compound rather than add.

Quantization compresses the bytes. TurboQuant, KIVI, KVQuant, NVIDIA’s NVFP4 — all reduce how many bits each cached value occupies in HBM. This directly shrinks capacity demand.

Sparse attention reduces the reads. Methods like DeepSeek’s NSA architecture and SparQ Attention do not shrink the cache — they reduce how much of it the model reads on each generation step. In long-context inference, the bottleneck is often bandwidth, not capacity. Sparse attention can cut data read per step by up to 8x.

CXL tiering moves cold data off-chip. CXL lets accelerators address large pools of external DRAM at low latency. KV cache that is not actively needed can spill from expensive HBM to cheap rack-level memory. Only the recent tokens stay on-device.

Combine all three: quantize the cache to 3 bits, read only the relevant fraction per step, and offload the cold tail to CXL memory. The maths gets aggressive fast. And none of these techniques existed in production two years ago.

The Market Is Already Repricing This

The tension between demand and compression is not theoretical. The market started pricing it in the week TurboQuant dropped.

Micron hit an all-time high of $471 on March 18, the day it reported the strongest quarter in its history. By March 25 — the day after Google published TurboQuant — the stock had fallen to roughly $382. A 19% drawdown from peak, despite record revenue, record margins, and sold-out HBM through 2026. Western Digital dropped 4.7%. SanDisk fell 5.7%. Samsung and SK Hynix weakened at the same time. Meanwhile, CPU stocks — Intel, AMD — surged as investors recalculated whether “insatiable” HBM demand might be tempered by algorithmic efficiency.

The sell-the-news reaction after earnings was predictable. The acceleration after TurboQuant was not. The market is telling you it heard the compression thesis — and it moved before most analysts had read the paper.

The standard memory cycle ends when demand cools or supply catches up. What makes this cycle different is that both sides are moving simultaneously — and in tension.

On the demand side, the forces are structural, not cyclical. KV cache requirements grow with every context window expansion, and context windows keep expanding. Agents consume more tokens as the technology matures, not fewer — longer loops, larger codebases, more parallel sessions. These are new floors that compound.

On the compression side, TurboQuant-class innovations are also structural. Training-free, drop-in deployment means adoption can move faster than traditional semiconductor cycles. And the compression vectors compound with each other.

So the question becomes: which force wins, and when does it flip?

Through 2026–2027, the demand side wins. Supply is already sold out. Compression techniques take time to deploy at scale across production inference stacks. Micron trades at roughly 10x forward earnings — a multiple that prices a cyclical peak, not a supercycle. The memory names still have the better near-term visibility.

From 2028 onwards, the compression side gets heavier. If TurboQuant-class techniques deploy broadly and CXL tiering becomes production-standard, the per-token HBM demand equation shifts. Total HBM demand probably still grows — AI workloads are still compounding — but the growth rate decelerates. SK Hynix’s pricing power compresses. Samsung, which has lagged on HBM yield and quality, gets hit at the worst time.

So Is This a Buy?

The memory crunch thesis is not broken. But it is no longer open-ended.

Start with what has not changed. HBM is sold out through 2026. Micron just guided $33.5 billion in Q3 revenue with 81% gross margins. The hyperscalers have pre-committed hundreds of billions in infrastructure capex against agentic workloads that are still ramping. TurboQuant is a research paper presented at a conference — it is not yet running in production inference at any major provider. The gap between “published” and “deployed at scale across Google, Microsoft, Amazon, and Meta’s inference fleets” is measured in quarters, probably years.

Micron at ~$382, down 19% from its all-time high with the strongest earnings in its history behind it and sold-out supply in front of it, looks like a buy on a 12-month view. The market is pricing TurboQuant’s impact as if it will compress HBM demand tomorrow. It will not. The production deployment timeline gives the memory names at least four to six more quarters of pricing power before compression can realistically dent contract economics.

he compression wave introduces a time dimension that did not exist six months ago. The near-term supply picture is clear; the 2028 picture is not. Trade accordingly.

Here is how I think about Micron at this junction:

Micron (MU) — a tactical trade, not a thesis reset. The fundamentals have not deteriorated: 10x forward earnings, Q3 guidance implying acceleration, sold-out supply through 2026. The 19% drawdown looks like the market front-running a risk that is real but not imminent. The compression wave will matter — just not in the next four to six quarters. That window is the trade. The risk is that the re-rating sticks regardless of earnings, and the multiple stays compressed even as the business keeps growing. Size accordingly.

The names that win regardless. The more interesting positioning might be the companies that benefit whether demand or compression dominates:

Teradyne (TER) — semiconductor test equipment. AI chips require ~10x more test time per die than prior generations. HBM4 is a substantially more complex package. Whether HBM demand keeps surging or the industry transitions to more complex, lower-volume HBM at higher ASPs, test intensity per wafer goes up. Teradyne’s demand is driven by complexity, not volume alone.

NVIDIA — the quiet winner of this entire dynamic. Memory efficiency increases GPU utilisation. A more efficient model runs at higher batch sizes and larger contexts on the same hardware. Compression opens markets that were previously too expensive to serve. NVIDIA sells the compute. Cheaper memory makes that compute more useful, not less.

Broadcom and Marvell — the CXL plays. If CXL tiered memory becomes the standard way to handle KV cache overflow, both companies benefit from switching fabric, interconnect, and custom silicon. CXL grows faster if HBM supply stays constrained, because CXL is part of the solution. This is a hedge that does not require the memory crunch to resolve in either direction.

What Could Break This

The demand thesis weakens if:

KV cache compression deploys faster than expected. TurboQuant is training-free and drop-in. If the major inference providers adopt it within quarters rather than years, the per-session HBM demand curve flattens sooner. Watch for API pricing changes at the major labs — falling per-token costs at long context lengths would be the leading signal.
Agent token efficiency improves. The 12x token growth from OpenRouter is real — but a meaningful share of it is architectural slop. OpenClaw sends full conversation history on every request, triggers four to five background API calls per visible response, and pulls unranked context that forces the model to process thousands of tokens it does not need. Engineers are already cutting bills by 90% with basic configuration changes. Prompt caching, model routing, and smarter context management could eliminate 60–80% of current agent token spend without sacrificing quality. The counterargument is Jevons paradox — cheaper tokens historically mean more usage, not less. But the efficiency headroom is large enough to temper any straight-line extrapolation of current demand.
Alternative architectures gain traction. Groq’s SRAM-based LPU design sidesteps HBM entirely for inference. If architectures like this reach commercial scale, they structurally reduce HBM addressable market.
DRAM pricing plateaus in Q2–Q3 2026. If memory prices stop climbing despite supply remaining tight, it signals demand absorption is slowing. This is the simplest and most direct disconfirming indicator.

The compression thesis weakens if:

Hyperscalers reinvest savings into longer contexts rather than banking them. This is what has historically happened — efficiency gains get consumed by expanded capability, not passed through as cost reduction. Google is already testing 10 million token contexts. Llama 4 Scout supports 10 million tokens. If the ceiling keeps rising, compression just enables more demand rather than reducing it.
Quality degradation at extreme compression proves worse in production than in benchmarks. TurboQuant reports zero quality loss at 3 bits in controlled tests. Production workloads at scale, with adversarial edge cases and long tails, may be less forgiving.

What We Are Watching

Agentic token consumption. The number to watch is what share of total tokens are consumed by agentic workloads versus simple chat. No lab has published that breakdown yet. When agents cross 50% of token volume at a major provider, the per-session memory profile of the entire fleet shifts dramatically. The first lab to disclose workload composition data will be a significant data point.

TurboQuant adoption velocity. The paper is at ICLR 2026. The question is how fast it moves from research to production inference at Google, and whether NVIDIA integrates similar techniques into TensorRT. If NVIDIA ships native KV cache quantization in its inference stack, adoption accelerates across the entire GPU ecosystem.

DRAM spot pricing — already flattening. This one is worth watching closely. InSpectrum spot data through late March 2026 shows DDR4 16Gb spot prices went near-vertical from October to December 2025 — roughly $5 to $70 — then flatlined from January through March. DDR5 shows the same pattern: stepped up through late 2025, sideways since January at around $32. Three months of zero price momentum after a parabolic move. NAND is lagging by about a quarter — still stepping up, but the slope is flattening. Meanwhile, Micron’s contract prices are still surging (mid-60s% sequential DRAM increases in Q2). That divergence between flat spot and rising contracts is the kind of signal that resolves in one of two directions: either spot is a leading indicator and contract pricing follows down with a lag, or spot is simply too thin to matter because the hyperscalers have pre-bought everything on contract. If spot stays flat through Q2 while contract re-rates lower in Q3, the demand-compression balance may be shifting sooner than the bull case assumes. (Source: inSpectrum Tech Inc DRAM/NAND spot pricing, March 2026)

CXL deployment timelines. Production CXL memory tiering at hyperscaler scale would be a structural shift in how KV cache is served. Broadcom and Marvell benefit directly. HBM-per-GPU requirements decline.

This is part of an ongoing series on AI infrastructure economics. The previous piece covered the full bottleneck sequence: power, packaging, fabs, and EUV

For more on the AI stack and where value flows, visit theupcurious.com.

Which AI Bottleneck Matters Most Right Now?

Upcurious — Wed, 18 Mar 2026 07:30:55 GMT

The most important question in AI investing right now is not who has the best model. It is what the system can physically support — and which constraint is most underpriced.

Dylan Patel, founder of SemiAnalysis, lays out this framework on the Dwarkesh Podcast. His argument: AI does not have one bottleneck. It has a sequence of them — and the important question is always which one matters most right now, which one is coming next, and which one sets the outer limit on the whole system.

This is still an infrastructure story. But in Patel’s framing, it is becoming more of a semiconductor manufacturing story than a construction story.

The Scale of the Buildout

Hyperscaler capex is running at $660-750B in 2026 — up 67% year-over-year. Cloud revenue growth is 25-30%. The capex-to-revenue ratio has hit 45-57%, which is historically unsustainable. Microsoft reports $80B in Azure backlog. The market keeps asking whether capex will slow down. So far, the answer is no.

The demand side is not in question. OpenAI hit $25B in annualized revenue (ARR) by February 2026, up from $6B in 2024. Anthropic surged to $19B ARR, up from $9B at the end of 2025. These are the fastest-scaling software businesses in history, and they are the primary drivers of compute demand going forward.

That is what makes Patel’s framework useful. The question is not whether AI demand is strong. It is which part of the system is failing to keep up with that demand — and who owns the bottleneck.

The Bottleneck Sequence

Now: The Deployment Constraints

These are the bottlenecks the market is focused on: CoWoS packaging, data center construction, power availability. They are serious — TSMC’s CoWoS capacity remains sold out through 2026, Vertiv’s organic orders surged 252% in Q4 2025. But Patel’s argument is that the biggest bottleneck is already shifting past them.

His point is that these sit higher in the stack, have simpler supply chains, and can be expanded with enough capital and execution. That does not mean they are easy, and it does not mean they stop being bottlenecks. It means he views them as more tractable than semiconductor manufacturing itself.

Data centers can be built in less than a year. Power — where Patel parts company with the “AI runs out of watts” crowd — can be sourced flexibly through industrial gas turbines, solar, nuclear, grid expansion, even ship engines. GE Vernova booked $2B+ in data center electrification orders in 2025 (3x the prior year) and has only addressed roughly 10% of its addressable market. Constellation Energy has nuclear PPAs contracted directly with hyperscalers. GPUs are sitting idle due to power constraints, not demand constraints. These are real bottlenecks with real pricing power — but in Patel’s framing, they are deployment constraints, not structural ceilings. Difficult but addressable with enough capital, risk-taking, and engineering effort.

The trade: Two clusters, different risk profiles.

The packaging and inspection names — TSMC (TSM), KLA (KLAC), Camtek (CAMT) — own the CoWoS chokepoint directly. TSMC controls roughly 60% of advanced packaging capacity and is embedded in every leading GPU shipment. KLA and Camtek sit on the yield management side: as CoWoS complexity increases, inspection becomes more critical and more recurring. These are not cyclical equipment names — they are process-critical infrastructure with multi-year demand visibility.

The power and electrical names — GE Vernova (GEV), Eaton (ETN), Vertiv (VRT), Constellation Energy (CEG) — own the constraint between “GPU purchased” and “GPU producing revenue.” Vertiv’s $15B backlog is up 109% year-over-year. Eaton’s backlog equals 11 years of 2025-level construction. The market still prices some of these like industrials. The backlog data says otherwise.

The honest risk: Patel views these deployment constraints as solvable. As power and packaging bottlenecks ease, the market will rotate toward the deeper constraints. These names have durable cash flows for now, but the long-term compounding story sits further down the stack.

Next: Fabs, Clean Rooms, and the Memory Crunch

The system has shifted from deployment chokepoints into something harder: physical semiconductor capacity. Fabs are the most complex buildings humans make, with lead times of two to three years. And the industry underbuilt when memory economics were weak — vendors halted new fab construction in 2023 when they were losing money. Micron’s own supply roadmap shows how long the lag really is: first wafer output from Idaho is expected in mid-calendar 2027, the Tongluo site supports meaningful shipments in fiscal 2028, Singapore HBM packaging is expected to contribute meaningfully in calendar 2027 (per prior guidance), and a new Singapore NAND fab is not expected to produce wafers until the second half of 2028.

This is what is actually driving the memory crunch. It is not just that HBM is tight. There is literally nowhere to place the manufacturing tools needed to scale production.

One gigawatt of Rubin-scale AI compute requires approximately 170,000 DRAM wafers (Memory), plus roughly 55,000 wafers of 3nm logic (GPU). Memory is not riding alongside the GPU story. It is the GPU story — just measured in a different unit. And the supply side is an oligopoly:

HBM consumes roughly 3x the wafer capacity per gigabyte versus commodity DRAM — widening toward 4x with HBM4. AI is expected to consume 20% of total global DRAM wafer capacity in 2026. DRAM contract prices are up 40-70%.

Micron’s latest quarter makes Patel’s point more concrete. The company now says both DRAM and NAND supply-demand conditions should remain tight beyond calendar 2026. More importantly, management explicitly tied the DRAM constraint to cleanroom shortages, long construction lead times, a higher HBM trade ratio, and declining bits per wafer from node migrations. That is almost a public-company version of Patel’s framework.

We are looking at a memory supercycle, not a normal cyclical upswing. The demand curve rises hard and fast if compute roadmaps hold. Supply is concentrated in three players maintaining discipline. And the missing fabs mean this crunch cannot be solved quickly — the capacity decisions that would have eased it needed to happen two years ago.

The trade: Three names, three different ways to own the same thesis.

Micron (MU) is the cleanest US public exposure, and the latest quarter materially strengthened the case. Fiscal Q2 revenue reached $23.86B with non-GAAP gross margin at 74.9%, and Q3 guidance points to roughly $33.5B of revenue with gross margin around 81%. DRAM inventory days remain especially tight at below 120 days, and Micron now expects fiscal 2026 capex to come in above $25B as it pushes harder on cleanrooms and long-duration capacity. MU now trades around ~10x forward earnings, and closer to ~5-6x if you annualize the current Q3 run rate.

The most important detail is what drove the quarter. This was not primarily a units story. DRAM bit shipments were up only mid-single digits sequentially, while DRAM prices rose in the mid-60s percentage range. NAND bit shipments were up low-single digits, while NAND prices increased in the mid-to-high 70s percentage range. That is what a bottleneck looks like in financial form: constrained supply showing up first in price, margin, and backlog urgency rather than just shipment volume.

SK Hynix is the dominant player — ~57% of global HBM supply, entire 2026 allocation sold out. The forward P/E sits around ~5x, which prices in Korea market risk and some skepticism about cycle duration. If you believe the supercycle extends past 2026, that discount is the opportunity. Accessible partly via EWY ETF for US investors, or directly on the Korean exchange.

Samsung is the wildcard. They are planning a ~50% capacity increase in 2026. Historically, Samsung has been willing to flood supply to gain share in NAND and commodity DRAM. But so far in this cycle, all three players are raising prices — Samsung and SK Hynix both hiked HBM3E prices ~20% for 2026, and server DRAM is up 60-70%. Samsung’s own exec said demand has already outpaced even the expanded supply. The risk is that discipline breaks as new capacity comes online. It hasn’t yet. Accessible partly via EWY alongside SK Hynix.

Later: The Hard Ceiling — EUV and ASML

By the end of the decade, Patel’s framework says the bottleneck drops to the deepest layer of the supply chain: the production of EUV lithography tools. These are the most complicated machines humans make, built by a single company in Veldhoven, relying on an artisanal supply chain of thousands of specialized suppliers.

ASML is expected to produce 60-70 EUV tools in 2026 — and will only reach a little over 100 per year by 2030. Each gigawatt of Rubin-scale AI compute requires around 3.5 EUV tools. That math implies an estimated upper bound of roughly 200 gigawatts of AI chip capacity by the end of the decade if enough of that tool base is allocated to AI.

Other constraints can ease. You can build more fabs. You can add power generation. You can expand packaging lines. But you can’t meaningfully accelerate EUV tool production — the ramp is measured in years, not quarters.

High-NA EUV — the next-generation TWINSCAN EXE:5200B at $350M per unit — has only shipped 8 tools in total, with high-volume manufacturing not expected until 2027-2028.

The trade: The 10-year compounder

ASML trades at ~40x forward earnings with a EUR 38.8B backlog. The market understands the monopoly. What it may not fully price is that ASML does not just supply the system — it defines the outer limit of how large the system can get. It doesn’t matter which chip architecture wins, which hyperscaler spends the most, or which AI model is the best. Every leading-edge chip on earth goes through an ASML tool. The question isn’t whether you own it — it’s how much.

One caveat. ASML’s upside is constrained by physics. They can only produce ~60-70 EUV tools per year, ramping to ~100 by 2030. Raising prices helps at the margin — High-NA tools are $370-400M versus $200M for standard EUV — but it does not change the unit constraint that governs the system. Gross margins sit at 52-53% today, guided to 56-60% by 2030. That is gradual expansion, not a step function. At ~40x forward earnings, you are paying for a quality compounder where volume is physically capped and margin expansion is slow. The backlog is durable, but the growth rate has a ceiling built into the manufacturing process itself. Size accordingly.

The H100 Argument

One of Patel’s more provocative claims: an H100 may be worth more today than it was several years ago.

Sounds backwards. But think about what an H100 actually is right now — not a depreciating asset on a balance sheet, but productive capital inside a world where models are getting better, inference demand is increasing, and useful AI work per installed accelerator keeps rising. If the software improves faster than supply expands, older hardware appreciates in economic usefulness even as newer hardware arrives.

Jensen Huang’s GTC 2026 keynote reinforces this. The market has shifted from training-heavy to inference-heavy. Every installed GPU is doing more valuable work than it was a year ago — not less.

Traditional compute depreciates. AI compute, in a supply-constrained world with improving models, may not. The installed base matters as much as the new shipment number.

Custom Silicon Is Part of the Map

Google’s TPUs, Amazon’s Trainium, and the custom ASICs that Broadcom and Marvell design for hyperscalers all run through the same physical constraints — CoWoS packaging, DRAM wafers, EUV tools. They change the competitive dynamics within each layer, but they do not escape the constraints.

Broadcom’s AI semiconductor revenue is up 74% year-over-year, with Google and Amazon representing roughly 60% of that AI revenue. Marvell is growing AI silicon revenue 22%. These are not NVIDIA competitors in the traditional sense — they are parallel demand streams through the same constrained supply chain.

The bottleneck framework is not just a bull case for NVIDIA. It is a bull case for the physical supply chain regardless of whether the marginal AI dollar goes to GPUs or custom silicon. TSMC, ASML, Micron, the equipment names — they win either way. The GPU vs. ASIC debate is a distraction from the deeper question, which is whether the fabs and tools exist to build any of it fast enough.

China Is a Question of Timing

Patel does not make a flat claim that the West wins and China loses. His view is more conditional, and it connects directly to the bottleneck framework.

Fast AI progress widens the Western advantage — the leading labs and infrastructure stack pull further ahead. Slower progress gives China time to catch up through manufacturing iteration and system-level adaptation.

The policy landscape adds texture. H200 exports were shifted to case-by-case review in January 2026 — partially reopened, but with conditions. Blackwell and more advanced architectures remain blocked. The AI Overwatch Act would tighten this further, treating AI chips as weapons exports.

The competitive map is shifting. China is making real progress in memory (CXMT), mature-node logic (SMIC), and system-level AI deployment (Huawei Ascend). Where it is not closing the gap is leading-edge logic — and that is exactly where ASML’s EUV monopoly and export controls create the deepest moat.

The faster the frontier moves, the more valuable the bottleneck owners become. Speed is what determines whether the Western infrastructure advantage compounds or narrows.

What Could Break the Thesis

Every thesis has failure modes. Here is where this one breaks:

Capex slows before application revenue catches up. Hyperscaler capex-to-revenue ratios at 45-57% are historically unsustainable. If even one major hyperscaler signals a capex pause, the infrastructure layers reprice quickly.

The inference inflection disappoints. Jensen Huang’s $1T projection covers cumulative chip sales through 2027 across Blackwell and Vera Rubin, underpinned by his view that agentic AI requires significantly more compute and that inference demand is at an inflection point. If agentic workloads do not scale that aggressively, the demand extension that supports the HBM supercycle and NVIDIA’s $1T outlook loses its foundation.

Deployment constraints ease faster than expected. If power, packaging, and construction bottlenecks resolve quickly — as Patel expects they will — the market may reprice those names before the deeper semiconductor constraints become fully visible. The transition from deployment bottleneck to manufacturing bottleneck is where positioning risk lives.

Fab capacity arrives faster than expected. If the fabs halted in 2023 come online ahead of schedule, or if Samsung and SK Hynix accelerate clean room buildouts, the memory crunch eases sooner than the 2027-2028 timeline suggests. Samsung’s planned 50% HBM capacity surge in 2026 is the first test. If DRAM pricing momentum fades materially in mid-2026, the supercycle narrative needs revisiting.

China closes the gap faster than expected. CXMT scaling DRAM production, SMIC advancing to N+2, or Huawei’s Ascend chips gaining meaningful inference share would compress the Western infrastructure premium. The lower-probability but higher-impact version: a genuine breakthrough in domestic EUV or alternative lithography — China has been investing heavily in DUV multi-patterning and homegrown lithography development — that reduces dependence on ASML tools at the leading edge.

What We Are Watching Next

Five items tied to the next reporting cycle:

Micron Q3 FY26 execution against guidance — Management guided to $33.5B of revenue and roughly 81% gross margin. The next question is whether pricing, inventory tightness, and capex commentary keep confirming the supercycle thesis.
ASML Q1 2026 earnings — EUV shipment count and High-NA EUV delivery timeline. Any acceleration in tool shipments changes the logic ceiling math.
Hyperscaler capex commentary (Q1 2026 earnings cycle) — Watch for any signal of capex moderation from Microsoft, Google, Amazon, or Meta. The $660-750B 2026 estimate needs reconfirmation.
DRAM contract pricing in Q2-Q3 2026 — If HBM and DRAM prices plateau, the memory scarcity thesis weakens. If they re-accelerate, the supercycle extends.
NVIDIA Vera Rubin shipment timeline — Any delay or acceleration reshapes the entire demand curve downstream.

Based on Dylan Patel’s interview on the Dwarkesh Podcast. Source: dwarkesh.com/p/dylan-patel

For more on the AI and how we track the stack, visit theupcurious.com.