There are many providers (Fireworks, Groq, Cerebras, Google Vertex), some using rather common hardware from Nvidia, etc., others their own solutions focused solely on high throughput inference. They often tend to be faster, cheaper and/or more reliable than what the lab that trained the model is charging [0], simply because there is some competition, unlike with US frontier models which at best can be hosted by Azure, AWS or GCloud at the same price as the first party.
You just click "buy" with another provider? They just host the model for money, nothing more.
To be clear: The models are open weights everyone can simply download, because many labs publish them as such. The providers in question are generic hosters. It's the same as if you would get some managed wordpress hosting somewhere.
I don’t really understand your point. A company offering only inference will inherently always have lower costs and require less hardware than one that both provides inference and does training of new models.
The point is it doesn't work like that. Check out all the special purpose vehicles, loans and fund raises going around in that space. Deals claiming for $x billion in funds only got millions + GPU rentals in special deals. And the offering of those deals still hinges on the top 3 said customers buying to float things up.
i.e. whether you go for "only inference" or "both" it'll work out similar somewhere.
If you are referencing Nvidia, note that Vertex, Groq, Cerebras, etc. all do not rely on GPUs for much or all of their inference and are all the better for it.
Kind of why Nvidia see themselves forced to make such deals, to lock labs in before they loose another to objectively superior inference.
Staying purely with US labs, they just lost Meta and both Anthropic and Gemini models have been available on Vertex without relying on GPUs for a while. OpenAI equally is turning towards Cerebras so yeah, those deals will mean little for inference.
Those circular deals aren’t a thing for the inference providers I tend to go for and they don’t change the fact that using a full GPU purely for inference is like using a pickup truck for a track day. It can be done and the flexibility has advantages, but once you focus on one task, there are better tools for the job.
[0] https://openrouter.ai/moonshotai/kimi-k2-thinking