Case study: vLLM inference across 3 regions, -60% cost per token

This post: an anonymized case we ran through spring and early summer 2025. The client name and the model name are under NDA, but the numbers and technical timeline are real, shared with the client's permission. The specific moment worth writing about: a 36-hour hunt for the 30 percent of throughput that "should have been there" per hardware spec and was not. The fix is now permanently in our week-1 checklist for any multi-GPU host.

Setup

Client: YC-backed LLM startup, eight engineers, one fine-tuned 70B checkpoint on top of an open base. At the time we engaged, they were burning around 140 thousand dollars a month on a managed inference API from one of the large vendors, and that number was growing 18-22 percent month over month. They closed a seed round, and a visible chunk of it was earmarked for "bring inference in-house."

The hardware was sourced from two bare-metal vendors across three regions. Frankfurt and Ashburn: 8×H100 PCIe each (PCIe specifically, not SXM, because HGX chassis were not available in their delivery window). Singapore: 16×L40S, for the smaller model variants and for speculative decoding. vLLM 0.4.x as the primary server, Triton Inference Server as a fallback for a few legacy endpoints they did not want to rewrite at the start.

Our scope: 4 months end to end. Stand up vLLM, route traffic via geo-DNS, ship observability with per-region cost-per-token, run autoscale per region taking into account each vendor's spot vs on-demand mix. No work on the client's model code itself, only the inference layer and everything around it.

Headline outcome after 4 months: cost per token down 60 percent against their managed-API baseline, p99 latency unchanged from what they were paying the API vendor for, throughput per host 28 percent above what we had planned in the project budget (more on this below).

Week 1: hardware, kernel, vLLM up

Week 1 went the way week 1 should go on bare-metal inference. By day 2 all three regions were up at the OS and Kubernetes level. By day 4 the Frankfurt cluster was running single-host vLLM on one H100, and we ran a benchmark against the vendor's published numbers for this model profile.

Single-host single-GPU result: within 4 percent of the vendor reference on tokens/sec, KV-cache eviction <1 percent, p95 first-token latency inside the SLO. This was within expectations, and the client team exhaled for the first time since the round closed. We exhaled too, but with a caveat: the full-host benchmark was still ahead, and that one is what shows real capacity, not one GPU out of eight.

Week 2: scaling out to full multi-GPU

Days 5-7 went to bringing up the tensor-parallel configuration across all 8 H100s on a single host, plus prefix caching, plus pinning NCCL to the correct interfaces.

Day 8, morning: first full-host benchmark, same model, same prompt dataset as in the single-GPU test, only now tensor-parallel=8. We expected roughly 7.2x scaling from the single-GPU number (a realistic estimate, not the naive 8x). We got 5.0x. Roughly 30 percent of throughput was missing relative to what the spec and the single-host single-GPU extrapolation said it should be.

This is the point where the case gets interesting. A 30 percent benchmark delta during validation is not "boring tuning," it is something structurally wrong. We opened a Sev-3 incident (our internal trigger: vendor-number delta >10 percent during validation), assigned two engineers as the duty lead, and started walking through the layers.

Hours 0-8 (day 8 of the project): first hypotheses

The standard suspect list, in decreasing order of likelihood:

vLLM version. Checked: 0.4.x current, release notes for our model and parallelism combination mention nothing problematic. Bumped one minor in a dev environment: same delta. Dropped.
CUDA / NCCL. Versions matched what the H100 vendor recommended in their reference config for PCIe chassis. NCCL_DEBUG=INFO confirmed collectives came up on the expected interfaces, no TCP fallbacks.
PCIe gen and lane width. nvidia-smi topo -m and lspci -vv confirmed: gen5 x16 on all eight slots, no throttle down to gen4, no slots linked at x8. That ruled out the simplest physical cause.
Thermal throttling. Telemetry clean: 75-78°C under load, throttle flags in nvidia-smi -q empty, fan curve nominal per the vendor.
KV-cache eviction. vLLM metrics showed eviction rate <2 percent on the full load profile. Not enough to explain a 30 percent delta, not even close.

By hour 8 we had a clean list of not-causes. Useful, but not the answer.

Hours 8-24: deeper profiling

With the surface layers eliminated, we built per-layer latency by hand. Wrapped each transformer block in a pair of torch.cuda.synchronize() calls plus CUDA events before and after, ran 500 requests through the profiled build, summed the distributions.

The picture cleared up. The forward pass inside a single transformer block was within expectations. But the between-block latency, the part that gets eaten by tensor-parallel collectives over NCCL, jumped from the typical 2-3 µs per GPU pair to 12-18 µs on some pairs. Not on all of them. On roughly half.

Quick arithmetic: take an extra 10-15 µs across each of 80 transformer blocks, applied on a subset of GPU pairs inside the tensor-parallel group, and you get something close to 25 percent of the missing throughput. So we had finally located where it was leaking, the question now was why.

Working hypothesis at this point: something in the physical host topology was making peer-to-peer transfer between the "bad" GPU pairs more expensive than between the "good" ones. In theory all 8 H100 PCIe cards on a single motherboard are equivalent. In practice a host of this class is almost always dual-socket.

Hour 28: the right command

The command we should have read on day 1: nvidia-smi topo -p2p w. An extended version of nvidia-smi topo -m that reports not just the PCIe topology but actual peer-to-peer write capability between every GPU pair.

The output was unambiguous: PIX (PCIe Switch direct, the cheap straight path) only within the NUMA group of each socket. Cross-NUMA pairs reported SYS, meaning the traffic was crossing the CPU root complex and then the UPI link between sockets. That UPI link, for GPU peer-to-peer traffic, is significantly more expensive in latency than a direct PCIe Switch path.

The frustrating part: the output of this command was in our setup logs from day 1. We had run it as part of standard hardware fingerprinting. Nobody read it carefully, because the single-host single-GPU benchmark "looked fine" in aggregate, and the team had moved on to bringing up the other regions in parallel.

This is the exact mistake the case is worth writing about. The data was there, nobody read it.

Hours 28-36: the fix

Three steps in sequence, each in its own change so we could roll back any single piece if something broke.

Hour 28-30: NUMA pinning of vLLM workers. Wrapped the vLLM launch in numactl --cpunodebind=N --membind=N so that the worker serving GPUs 0-3 sat on the CPU and memory of NUMA node 0, and the worker on GPUs 4-7 on node 1. This solves half the problem: it removes cross-NUMA memory access but does not yet remove cross-NUMA P2P between GPUs if they end up in the same tensor-parallel group.

Hour 30-33: rebuilt the vLLM unit with the right CUDA_VISIBLE_DEVICES. The default config launched one tensor-parallel=8 process that greedily took all 8 GPUs and had no concept of NUMA boundaries. We split it: two processes tensor-parallel=4, each seeing exactly the 4 GPUs that sit on its NUMA node. At the service level we added request routing between the two processes on one machine, which vLLM supports natively through its scheduler.

Hour 33-36: Kubernetes node affinity and pod anti-affinity. So this would not break on the next restart or scale event, we wrote the rules: tensor-parallel pods get NUMA-local GPU groups through the device plugin with topology hints, and cross-NUMA tensor-parallel configurations are simply disallowed at the scheduler level.

Hour 36: throughput +28%

After all three steps we reran the same full-host benchmark. Scaling came in at 7.4x of the single-GPU number, 28 percent above the previous 5.0x. Cost per token, recomputed against actual throughput, dropped another 12 percent on top of the savings already coming from leaving the managed API.

End-to-end incident window: roughly 36 hours from the first failed benchmark to the locked-in production configuration. No client traffic was on the hosts at this point. That is what the validation week is for.

What went into the runbook

Four entries, added to our internal week-1 checklist and to our project runbook templates.

NUMA topology inspection is mandatory. On every new multi-GPU host on day 1 we run nvidia-smi topo -m, nvidia-smi topo -p2p w, lscpu, and numactl --hardware, and link the output into the project runbook. Not just run it, explicitly read and annotated by a human: "cross-NUMA GPU pairs present: yes/no, which ones."
Vendor-number benchmark delta is a Sev-3 trigger if greater than 10 percent. Previously this was an informal "let's discuss if it looks weird." Now it is an explicit threshold that opens an incident, assigns an owner, and goes through the standard incident flow. MTTA on validation-class incidents at this severity is currently 12-18 minutes for us.
Tensor-parallel placement is declared explicitly. No "vLLM will figure it out" on split-NUMA hosts. The service config spells out NUMA-bound GPU groups, and the Kubernetes scheduler knows the topology through the device plugin with topology hints.
Observability sidecar reports numa_misses. An extra metric from numastat on every node, exported to Prometheus, with an alert that fires if the miss percentage on an inference worker drifts above the baseline profile.

The lesson

Big-iron defaults assume a single-NUMA host. On a dual-socket machine with GPUs split across root complexes, the framework does not try to guess your topology: it picks the convenient default (one big tensor-parallel process across all visible GPUs), and that default quietly pays the cross-NUMA penalty on every collective.

This is not a vLLM bug and not an H100 PCIe bug. It is a place where there is no automation, and the engineer is expected to declare intent explicitly.

A second, more general lesson: "looks fine in aggregate" is the most expensive sentence in inference operations. The single-GPU benchmark looked fine. The setup logs looked fine. The thermal picture looked fine. The delta was visible only in one specific command run on day 1 and in per-layer profiling done at hour 28 of the incident. Everything else was noise.

And the third, the most important one for us as a contractor: this story happened during validation week, not in production. No client traffic, no SLA window, no midnight emails. Validation exists precisely so stories like this happen on it rather than on live inference with real load. When we sell 4 months on standing up an inference cluster, that week is not in the budget for decoration.

The XIMTRX team