Self-hosted LLM cost vs the API: the utilization number your board never sees

2026-06-28

A monitoring dashboard showing a GPU utilization graph that sits low and flat overnight and on weekends, next to a rising cloud API cost line, illustrating the idle-GPU problem in self-hosted LLM economics.

GPU rental prices barely moved this year, so the build-vs-buy decision is not really about the hourly rate. It is about one number most teams never measure: how busy the card actually stays.

TLDR

GPU rental rates barely moved this year, so build-vs-buy is not really a question about the hourly price. It is a question about sustained utilization: a self-hosted card only beats the API on cost when it stays busy, and a typical cluster sits idle most nights and weekends. Before approving any self-hosting plan, ask one thing: what is our measured GPU utilization, and can we hold it above 60 percent.

I pulled the live GPU rental numbers this morning, half-expecting another panic chart. The opposite was true. As of June 28, GetDeploying’s tracker has the H100 cloud average at $3.13 an hour across 47 providers, up about 3 percent from a year ago. The H200 sits at $4.06 across 33 providers, up roughly 4 percent. The B200 averages $4.86 across 26 providers, up about 6 percent. After all the memory-shortage headlines, the thing a board worries about most, the rental rate, is almost exactly where it was last June.

That stability is the story, because it quietly moves the build-vs-buy argument onto different ground.

Why the rental rate is not where the money is decided

When the hourly price of a rented GPU drifts only 3 to 6 percent in a year, it stops being the variable that decides the bill. Two companies can rent the identical H100 at the identical $3.13 an hour and end up with wildly different cost per token. The rate is the same. What differs is how much work the card does per hour it is billed.

A GPU costs the same whether it serves zero tokens or four billion. Rent one and idle it overnight, and you paid full price for nothing. That is the part the “we will self-host to save money” deck always leaves out. The slide has the model, the quant, and the hardware. It almost never has a line for utilization, which is the one number that actually sets the unit economics.

So the honest framing of self-hosted LLM cost vs the API is not “is the GPU cheap.” It is “can we keep the GPU busy enough to beat a service that charges us nothing when we are quiet.”

22%

average GPU utilization in a typical self-hosted inference cluster, per Spheron's April 2026 FinOps writeup, before any batching work

What utilization actually does to cost per token

Here the recent analysis gets concrete, and to be clear these numbers are background, dated April and May 2026, not a fresh finding from this week. They are well corroborated, and they are the part that matters.

In a GPU FinOps writeup from early April, Spheron put a typical self-hosted cluster at 22 percent average utilization, idle from 7pm to 8am and all weekend, then showed what happens when continuous batching is switched on properly. As Spheron put it:

"Continuous batching (as implemented in vLLM and SGLang) can raise GPU utilization from under 20% to over 70%, cutting effective cost per token by 3-4x."

Spheron, April 2026

Read that twice, because it reframes the whole decision. The difference between a self-hosting plan that saves money and one that quietly loses it is not the model and not the chip. It is whether the serving stack keeps the card full. Their worked example: an 8x H100 pod at $19.20 an hour, serving Llama 3.1 70B and pushing 2,800 tokens a second through vLLM, lands near $1.90 per million tokens. Genuinely competitive, but it depends entirely on the card running near capacity, which is exactly what a bursty, business-hours workload does not do on its own.

A break-even guide from Rene Zander in mid-April set the floor in plain language: if you cannot keep utilization above 60 percent, the per-token economics look worse than the brochure suggests. A GPU at rest costs the same as a GPU at 100 percent. That single sentence is the whole reason so many self-hosting projects underdeliver.

Local inference is cheaper per token only above a utilization line, and most teams have no idea where their line is.

Three questions your board will ask

Is self-hosting cheaper, yes or no? Only above a volume and utilization line, and the line moves depending on what is being replaced. Against a frontier API like Claude Sonnet 4.6, self-hosting an open 70B model breaks even around 2 to 5 million tokens a day if the open model meets the quality bar, and it is clearly ahead past roughly 100 million a day. Against a cheap API tier, the break-even climbs to 15 to 25 million tokens a day, and on a single card it can be physically out of reach. Below those lines, the API wins, because it charges nothing while idle and a GPU never stops billing.

Why does our infra lead keep saying the API is fine? Because they are probably right for the current volume. The teams that save money self-hosting are the ones running steady, high traffic on a workload where an open model is good enough. The teams that lose money are the ones with spiky, business-hours traffic and a card that sleeps two-thirds of the month. Both can be true in the same company at the same time.

If we commit, what is the real cost? More than the GPU. The recent analysis is consistent that self-hosting runs 3 to 5 times the raw rental price once you count the engineering time, the redundancy, and the monitoring. GMI Cloud’s late-May writeup noted that continuous batching alone delivers 2 to 3 times the throughput of static batching, and that teams report 60 to 80 percent bill reductions when caching, batching, and routing all apply. That upside is real, and it is also a list of things someone has to build and keep running. The savings are not in the download. They are in the operations.

Key Insight

The build-vs-buy decision is usually argued as rent-vs-own or open-model-vs-API. The fresh data says neither axis is where the money moves. The money moves with sustained utilization, the one number that rarely makes it onto a slide.

The 60-second brief

Rental prices barely changed this year, so the self-host question is no longer about the hourly rate. It is about whether we can keep a card busy. A typical cluster runs around 22 percent utilization and needs to clear 60 percent before the math works. Continuous batching, caching, and smart routing are what get it there, and they are operational work, not a one-time setup. We should self-host only where volume is steady and high and an open model is good enough, and keep the bursty, low-volume work on the API where idle time is free. If nobody in the room can state our current measured utilization, the self-host case is not ready to approve yet.

What to watch

Two things. First, whether the workload is steady enough to hold a high utilization number, or whether half of it is honestly fine staying on the API while local handles the dense, predictable load. A hybrid split is not a failure, it is usually the right answer. Second, keep an eye on the frontier-open releases, but with clear eyes about what they cost to run. The strongest open model of the moment still needs roughly eight high-end cards to self-host, against a hosted-API floor of a few tens of dollars a month, so the open option is often not even a single-box decision. None of that is a reason to rush. The rental rate will probably still be sitting near $3 an hour next quarter, which means the question worth answering calmly this week is the one about measured utilization, not the one about the price of the chip.

Sources

H100 Cloud Pricing: Compare 47+ Providers (2026) - GetDeploying, 2026-06-28
H200 Cloud Pricing: Compare 33+ Providers (2026) - GetDeploying, 2026-06-28
B200 Cloud Pricing: Compare 26+ Providers (2026) - GetDeploying, 2026-06-28
AI Inference Cost Economics in 2026: GPU FinOps Playbook - Spheron, 2026-04-04
Self-Hosted LLM vs API Cost: Break-Even Analysis 2026 - Rene Zander, 2026-04-16
Cutting LLM Inference Costs in 2026: Caching, Batching, Smart Routing - GMI Cloud, 2026-05-28

Back to all insights