Bare metal or cloud GPU: the cost-per-token reality

For validators we almost always end up on bare metal: the latency tail and slashing exposure outweigh everything else. For GPU inference the calculus is different, and here we split a client's load between bare metal and cloud far more evenly. Below we cover why a GPU task is costed differently, where cloud honestly wins, and where bare metal beats it on cost-per-token.

Why a GPU is not costed like a validator

A validator's load is flat and predictable: it signs around the clock with a nearly constant profile. GPU inference is almost never flat. Daytime traffic jumps, at night some cards sit idle, a feature launch can double the load in a week. And the card itself is the most expensive unit in the rack, so every idle hour is money thrown into cost-per-token.

So the key GPU question is not "cloud costs more than metal per hour" (it does, always) but "what fraction of the GPU-time you paid for actually turned into tokens." A card at 80 percent utilization on your own metal and a card at 30 percent in cloud give completely different cost-per-token, even if the cloud one looks comparable per hour.

Cloud GPU: where it honestly wins

In GPU tasks cloud solves what metal does not:

The card you need, right now. When a client needs an H100 this week, not after a quarter of lead time, cloud on-demand brings it up in minutes. At peak demand for a specific card this is sometimes the only way to start at all.
Burst for uneven traffic. If load swings several times over between day and night, keeping peak card count on metal around the clock is wasteful. Cloud lets you add cards for the peak and drop them in the trough, paying for time actually used.
Short windows: experiments, finetuning, model evaluation. Running a new model for a couple of weeks and drawing conclusions is cheaper in cloud than buying cards for a task with a horizon under a year.
Spot for interruptible work. Batch inference and offline processing that survive interruption get radically cheaper on spot instances, if you handle preemption properly.

If we had only metal, we would lose clients with uneven traffic and with urgent deadlines on a scarce card.

Bare metal GPU: where it beats cost-per-token

On flat, constant, predictable load, metal wins on cost-per-token by a wide margin. A card in your own rack costs a fixed sum per month regardless of how many tokens went through it, so at high utilization the cost of a token drops to a level hourly cloud does not reach.

Beyond the per-hour price, metal has two more levers. The first is control over topology: NVLink between cards, local NVMe for weights and cache, no virtualization layer between the model and the GPU. The second is egress: on metal, outbound traffic is not metered by the minute, while on heavy inference with long answers cloud egress adds up to a noticeable line. A baseline serving workload with constant traffic is almost always cheaper on metal, and the cheaper the higher the card utilization.

The cloud costs that hide

The cloud's per-hour price is not the whole cost. What lands in cost-per-token is what does not show up large on the provider's calculator:

An idle reserved card. You pay for a reserved instance in full whether it is busy or not. At low utilization the fixed payment spreads across fewer tokens, and cost-per-token rises just like on metal, only more expensively.
Egress on long answers. Generative traffic goes out, and at volume it is a real line that metal does not have.
A scarcity premium. At peak demand for a specific card, on-demand pricing creeps up, and budgeting a quarter ahead becomes guesswork.

Cloud in GPU is a tool for the uneven, the scarce, and the short-lived. For a flat baseline load it almost always loses to metal on cost-per-token, the loss just arrives in utilization and egress rather than in the per-hour price.

How we choose for a client

On the polygon we cost both options on the client's real traffic profile, with their length distribution and their daily load curve. It usually comes out a hybrid: a flat base on metal, peaks and experiments in cloud, interruptible batch on spot. From there it becomes a layout: what lives where, how cards are added for the peak, which alerts watch utilization and VRAM.

If you need to understand what your inference will actually cost rather than comparing per-hour price tags, that is what we run in ai/llm and cover through deploy and scale. Want to cost cost-per-token for your load on metal and in cloud: get in touch.