There is a piece of this I agree with. That you do not need to be a deep technic...

beepbooptheory · 2026-06-09T00:40:44 1780965644

Doesn't this kinda imply its own smoke and mirrors though? Like if the name of the game with inference is already routing things around and caching so you can make money, why is the newest biggest model always the most important critical thing? How does this square with any of their press about it? Also wouldn't that just add more inference? Because you need to pre-judge every prompt to know where to route it?

Also, if there is significant gains from caching, then like.. what are even doing here? Inputting something and then reading cached pieces of text based on their similarity to the input? Kinda like a search engine?

spmurrayzzz · 2026-06-09T17:11:42 1781025102

I don't think its smoke and mirrors, though I do have plenty of gripes with how the labs market this product landscape generally speaking.

The newest biggest model can still matter even if you do not run every prompt through it. You'll always have some task where even small amounts of loss are unacceptable and thus you need to make sure frontier intelligence is used for it.

On the router point, yes, routing has some overhead. But the router does not need to run the biggest model to decide which model to use. We've been using tiny classifiers for recommendation engines for ages now, usually on CPU. If routing saves you from sending a large fraction of traffic to the expensive reasoning model, the routing overhead can easily be worth it.

> Also, if there is significant gains from caching, then like.. what are even doing here? Inputting something and then reading cached pieces of text based on their similarity to the input? Kinda like a search engine?

The caching I'm talking about is explicitly the attention/kv cache, so its not input similarity retrieval (that would be more like what you'd use in a RAG/IR system). Prompt caching is generally about reusing already-computed attention scores for repeated prompt prefixes. The idea being you don't recompute the same static system prompt, tool definitions, schemas, long shared context, or repeated boilerplate every time. In more sophisticated systems, you usually store multiple checkpoints so that a small prompt change doesn't result in all-or-nothing hit/miss scenario.