There is a piece of this I agree with. That you do not need to be a deep technical expert to notice that a company is burning cash by overcommitting to capex, or relying on heroic revenue projections that may or may not come to pass.
But that is not the full argument he is making. If the claim is that the labs will not be able to pay their creditors because inference is structurally incapable of becoming profitable, then he absolutely needs to be right about the technical economics of inference.
One part of that is the balance-sheet argument (which already shows insanely good margins). But it also depends on how inference-time compute actually works: routing, batching, kv cache reuse, model segmentation, different latency tiers, etc. Much of those details he's just been straight up wrong about in his writing, so as a result I have to call into question the rest of his reasoning as well (in part to avoid Gell-Mann amnesia).
Doesn't this kinda imply its own smoke and mirrors though? Like if the name of the game with inference is already routing things around and caching so you can make money, why is the newest biggest model always the most important critical thing? How does this square with any of their press about it? Also wouldn't that just add more inference? Because you need to pre-judge every prompt to know where to route it?
Also, if there is significant gains from caching, then like.. what are even doing here? Inputting something and then reading cached pieces of text based on their similarity to the input? Kinda like a search engine?
I don't think its smoke and mirrors, though I do have plenty of gripes with how the labs market this product landscape generally speaking.
The newest biggest model can still matter even if you do not run every prompt through it. You'll always have some task where even small amounts of loss are unacceptable and thus you need to make sure frontier intelligence is used for it.
On the router point, yes, routing has some overhead. But the router does not need to run the biggest model to decide which model to use. We've been using tiny classifiers for recommendation engines for ages now, usually on CPU. If routing saves you from sending a large fraction of traffic to the expensive reasoning model, the routing overhead can easily be worth it.
> Also, if there is significant gains from caching, then like.. what are even doing here? Inputting something and then reading cached pieces of text based on their similarity to the input? Kinda like a search engine?
The caching I'm talking about is explicitly the attention/kv cache, so its not input similarity retrieval (that would be more like what you'd use in a RAG/IR system). Prompt caching is generally about reusing already-computed attention scores for repeated prompt prefixes. The idea being you don't recompute the same static system prompt, tool definitions, schemas, long shared context, or repeated boilerplate every time. In more sophisticated systems, you usually store multiple checkpoints so that a small prompt change doesn't result in all-or-nothing hit/miss scenario.
But that is not the full argument he is making. If the claim is that the labs will not be able to pay their creditors because inference is structurally incapable of becoming profitable, then he absolutely needs to be right about the technical economics of inference.
One part of that is the balance-sheet argument (which already shows insanely good margins). But it also depends on how inference-time compute actually works: routing, batching, kv cache reuse, model segmentation, different latency tiers, etc. Much of those details he's just been straight up wrong about in his writing, so as a result I have to call into question the rest of his reasoning as well (in part to avoid Gell-Mann amnesia).