[ vLLM inference cluster ] →
H100 / A100 with continuous batching, paged attention, autoscaling on queue depth and p95 TTFT, request tracing via OpenTelemetry.
> solutions / ai-llm
You raised, you bought GPUs, the bill is bleeding. Your ML team can train but doesn't want to babysit vLLM, OOMs and Triton at 3 AM. We run the inference layer: autoscaling, tracing, cost-per-token. Your researchers stop being on-call.
We arrive with a working stack: vLLM and TensorRT-LLM for serving, Triton for multi-model endpoints, NVIDIA H100 / A100 as baseline, Ray and Kubeflow for distributed fine-tuning. For each workload we wire up continuous batching, kv-cache offload to NVMe, and tensor parallelism sized to the model.
We bring hardware via our supplier contracts for H100/H200/B200, or we operate on top of your cloud account. Supply window for a dedicated GPU pool: 7-14 days depending on region and card model.
AI / LLM subset. The platform layer is identical across ICPs.
Concrete deliverables for AI / LLM teams. Each ships end-to-end with repo, IaC and runbooks.
H100 / A100 with continuous batching, paged attention, autoscaling on queue depth and p95 TTFT, request tracing via OpenTelemetry.
Several models behind one endpoint: dynamic batching, model-mesh for cold start <5s, A/B routing by header.
Compile per-GPU (FP8, FP16, AWQ-quant), per-case benchmarks for latency and throughput, artifact cache in S3.
Multi-node DDP / FSDP / DeepSpeed-ZeRO, checkpoint store, auto-retry on preempt, GPU-utilization dashboards.
Per-model, per-tenant, per-region breakdown. Alerts on budget burn, savings recommendations across the spot/on-demand mix.
After handoff the pager lives with us. Coverage tuned for LLM workloads:
What we move without downtime for production inference.
Inference shifted off spot onto dedicated H100/H200: typical 60% reduction in cost per token, p99 latency stabilizes.
Offload traffic from a managed API onto your cluster: shadow mode, gradual per-tenant cutover, fallback to the proxy on incident.
GPU fleet moved from AWS p4d/p5 onto bare metal at Latitude.sh / DataPacket: 40% cost reduction, controlled artifact sync.
Model rebuild to FP8 or AWQ: 2x VRAM reduction, quality benchmark (perplexity, harness metrics) at each step.
Inference split across 3+ regions for latency and failover: geo-routing, model replication, tenant-sticky sessions.
Parallel shadow inference, output-matching quality control, gradual traffic cutover by cohort.
Anonymized. NDAs cover names; the numbers are real.
Three coverage levels. For production inference with user traffic we recommend Silver or higher: an OOM at 3 AM won't wait until morning.
| Tier | Response p95 (Sev-1) | Coverage | Incident report | Engineer hours / mo |
|---|---|---|---|---|
| Bronze | 30 min | Business hours, 5×8 | Within 48h | 40 |
| Silver | 15 min | 24/7 on-call rotation | Within 24h | 80 |
| Gold | 5 min | 24/7 with dedicated engineer | Within 12h | 160+ |
We come in on top of your hardware as a DevOps team. We bring up Kubernetes on GPU, MIG partitions, scheduling, observability. In 2-3 weeks you have an inference cluster; in 4 you have a signed SLA with 24/7 coverage. Provider billing stays with you.
Yes. We have open relationships with Latitude.sh, DataPacket, OpenMetal, and regional bare-metal operators. Supply window for H100 / H200: 7-14 days. Spot access to A100s usually within 72h. Send the spec; we reply with a concrete window in 24h.
Any open-weight transformer via vLLM or TensorRT-LLM: Llama, Qwen, Mistral, Mixtral, DeepSeek, Phi, Gemma. Custom architectures via Triton Python backend. Audio / vision models via Triton with an ensemble config.
We run the infrastructure for fine-tuning, not the fine-tuning itself. That means: distributed cluster on Ray / Kubeflow, checkpoint store, retry mechanics, GPU-utilization dashboards. The actual ML work (LoRA / SFT / DPO recipes) stays on your side. If you need an ML expert, we can bring in a partner.
We don't sign a fixed number (depends on model, context length, batching). We do sign: a per-tenant cost dashboard, a monthly perf review with concrete optimizations, alerts on >10% drift. Typical economics: 40-60% cost reduction in the first 2 months via batching, quantization, and a spot/on-demand mix.