GPU inference: why vLLM is our default

When a client shows up with "deploy our LLM in production," the first engineering question is not the model or the GPU but the inference engine. It decides how many tokens per second you get off the same card, and therefore the cost-per-token the whole service lives on. Our default is vLLM. Not because of hype, but because on our polygon it consistently gives the best cost-per-token on typical serving loads. Below we explain what it actually wins and where we step away from it.

What we optimize: cost-per-token, not QPS

"Requests per second" is the wrong primary metric for an LLM service. Requests of different lengths cost radically different amounts: a 4,000-token prefill and a 50-token decode are two incomparable loads on the GPU. We measure cost-per-token separately for prefill and decode, plus p99 on time-to-first-token and on inter-token latency.

A GPU costs money every second, whether it is doing useful work or waiting. So the question is: what fraction of GPU-time turns into a real bill for tokens rather than idle between requests. An engine that keeps the card busy wins on cost-per-token, even if a synthetic single-request benchmark shows it is not the latency leader.

Why vLLM by default

Two mechanisms make vLLM the default for serving loads.

Continuous batching. A naive server assembles a batch, runs it whole, and only then takes the next one: short requests in the batch wait for the longest. vLLM adds and removes requests from the batch on every decode step, so a finished request immediately frees its slot for a new one. On mixed traffic, where request lengths jump around, this lifts real card utilization several times over static batching.

PagedAttention. The KV-cache is the main consumer of GPU memory at inference, and naively it is allocated as a contiguous block sized for the maximum length, which fragments memory and leaves it idle. vLLM lays the KV-cache out in pages, like an OS virtual memory. Fragmentation drops, more concurrent sequences fit on the same card, throughput rises, cost-per-token falls.

On top of that vLLM handles the plumbing that would otherwise land on us: tensor parallelism across several cards, an OpenAI-compatible endpoint, token streaming. Less of our code between the client and the GPU means less of our code that breaks at 3 a.m.

Where vLLM does not win

The default is not dogma. We move off vLLM in a few cases:

A hard single-request latency ceiling. When a client needs the absolute minimum time-to-first-token at low concurrency rather than throughput at high concurrency, TensorRT-LLM on an engine built for the specific card squeezes out latency vLLM does not reach. The price is lost flexibility: the engine must be rebuilt per model and per GPU.
A zoo of models and modalities on one server. When one piece of infrastructure runs a heterogeneous set of models plus preprocessing, Triton Inference Server as an orchestrator beats a single engine: it routes several backends and pipelines behind one front.
Exotic or heavily quantized weights that vLLM on the version we need does not yet handle cleanly. Then model support dictates the engine choice, not our preference.

The rule is simple: vLLM by default for serving loads with variable length and high concurrency, a specialized engine where the task has a narrow requirement the default does not cover.

The operational side

Engine choice is the start, not the end. In production it is usually not vLLM itself that breaks but what surrounds it: the KV-cache hits the memory ceiling and requests start getting preempted, OOM under growing context arrives not immediately but on the tail of the length distribution, tensor parallelism is sensitive to NVLink topology. So we alert not on "the process is alive" but on p99 inter-token latency, on the share of preempted requests, and on KV-cache occupancy. Those signals catch degradation before the client's users do.

What it looks like for a client

On the polygon we run the engines against the client's real traffic profile, not synthetics: we look at cost-per-token and p99 on their length distribution, and pick the default or the exception from the numbers. From there it becomes deployment and operation: which engine, which layout across cards, which alerts, who is on call.

If you need to stand up LLM inference with cost-per-token worked out rather than left to chance, that is what we run in ai/llm and cover through deploy and operate. Want to estimate cost-per-token for your model and traffic: get in touch.