> solutions / ai-llm

Managed infrastructure for LLM inference and fine-tuning

You raised, you bought GPUs, the bill is bleeding. Your ML team can train but doesn't want to babysit vLLM, OOMs and Triton at 3 AM. We run the inference layer: autoscaling, tracing, cost-per-token. Your researchers stop being on-call.

We arrive with a working stack: vLLM and TensorRT-LLM for serving, Triton for multi-model endpoints, NVIDIA H100 / A100 as baseline, Ray and Kubeflow for distributed fine-tuning. For each workload we wire up continuous batching, kv-cache offload to NVMe, and tensor parallelism sized to the model.

We bring hardware via our supplier contracts for H100/H200/B200, or we operate on top of your cloud account. Supply window for a dedicated GPU pool: 7-14 days depending on region and card model.

> stack we operate

AI / LLM subset. The platform layer is identical across ICPs.

AI / LLM: vLLM Triton TensorRT-LLM NVIDIA H100 / A100 Ray Kubeflow

Platform: Kubernetes Terraform Ansible Prometheus Grafana Loki OpenTelemetry PagerDuty

> what we deploy

Concrete deliverables for AI / LLM teams. Each ships end-to-end with repo, IaC and runbooks.

[ vLLM inference cluster ] →

H100 / A100 with continuous batching, paged attention, autoscaling on queue depth and p95 TTFT, request tracing via OpenTelemetry.

[ Multi-model serving on Triton ] →

Several models behind one endpoint: dynamic batching, model-mesh for cold start <5s, A/B routing by header.

[ TensorRT-LLM build pipeline ] →

Compile per-GPU (FP8, FP16, AWQ-quant), per-case benchmarks for latency and throughput, artifact cache in S3.

[ Distributed fine-tuning on Ray + Kubeflow ] →

Multi-node DDP / FSDP / DeepSpeed-ZeRO, checkpoint store, auto-retry on preempt, GPU-utilization dashboards.

[ Cost-per-token dashboard ] →

Per-model, per-tenant, per-region breakdown. Alerts on budget burn, savings recommendations across the spot/on-demand mix.

> what we operate 24/7

After handoff the pager lives with us. Coverage tuned for LLM workloads:

GPU health watchdog: ECC errors, thermal throttling, driver xid signals trigger preempt and load migration.
Auto-recovery from OOM: batch-size shrink, kv-cache eviction, model-version rollback playbook.
p95 / p99 latency SLO per endpoint: alert on drift >15% off baseline over 5 min.
Cost-per-token alerts: if actual cost climbs >10% over 24h, on-call engineer investigates before closing the ticket.
Versioned runbooks: model rollback, region traffic shift, deflake a flapping endpoint.
Monthly perf review: fresh benchmarks, updated spot/on-demand mix, batching recommendations.

> migration scenarios

What we move without downtime for production inference.

spot-fleet to dedicated H100

Inference shifted off spot onto dedicated H100/H200: typical 60% reduction in cost per token, p99 latency stabilizes.

OpenAI proxy to in-house inference

Offload traffic from a managed API onto your cluster: shadow mode, gradual per-tenant cutover, fallback to the proxy on incident.

cloud to bare-metal

GPU fleet moved from AWS p4d/p5 onto bare metal at Latitude.sh / DataPacket: 40% cost reduction, controlled artifact sync.

FP16 to FP8 / quantization

Model rebuild to FP8 or AWQ: 2x VRAM reduction, quality benchmark (perplexity, harness metrics) at each step.

single-region to multi-region

Inference split across 3+ regions for latency and failover: geo-routing, model replication, tenant-sticky sessions.

engine swap (TGI to vLLM)

Parallel shadow inference, output-matching quality control, gradual traffic cutover by cohort.

> cases

Anonymized. NDAs cover names; the numbers are real.

LLM startup · 4 mo · vLLM cluster across 3 regions · cost / token: -60% · p95 TTFT: 180 ms

Voice-AI product · 8 mo · 24 H100, multi-model Triton · 99.96% uptime · 10x autoscale at peak

B2B copilot · 6 mo · fine-tune pipeline + serving · time-to-experiment from 3 days to 4 hours

Research lab · 3 mo · 64 A100 spot fleet · 0 lost checkpoints over the quarter

> SLA tiers

Three coverage levels. For production inference with user traffic we recommend Silver or higher: an OOM at 3 AM won't wait until morning.

Tier	Response p95 (Sev-1)	Coverage	Incident report	Engineer hours / mo
Bronze	30 min	Business hours, 5×8	Within 48h	40
Silver	15 min	24/7 on-call rotation	Within 24h	80
Gold	5 min	24/7 with dedicated engineer	Within 12h	160+

> FAQ

We already bought H100s. What do you do?

We come in on top of your hardware as a DevOps team. We bring up Kubernetes on GPU, MIG partitions, scheduling, observability. In 2-3 weeks you have an inference cluster; in 4 you have a signed SLA with 24/7 coverage. Provider billing stays with you.

Can you help source GPUs? Demand is brutal.

Yes. We have open relationships with Latitude.sh, DataPacket, OpenMetal, and regional bare-metal operators. Supply window for H100 / H200: 7-14 days. Spot access to A100s usually within 72h. Send the spec; we reply with a concrete window in 24h.

Which models do you support in serving?

Any open-weight transformer via vLLM or TensorRT-LLM: Llama, Qwen, Mistral, Mixtral, DeepSeek, Phi, Gemma. Custom architectures via Triton Python backend. Audio / vision models via Triton with an ensemble config.

What about fine-tuning? Do you train models for us?

We run the infrastructure for fine-tuning, not the fine-tuning itself. That means: distributed cluster on Ray / Kubeflow, checkpoint store, retry mechanics, GPU-utilization dashboards. The actual ML work (LoRA / SFT / DPO recipes) stays on your side. If you need an ML expert, we can bring in a partner.

What cost-per-token guarantees do you offer?

We don't sign a fixed number (depends on model, context length, batching). We do sign: a per-tenant cost dashboard, a monthly perf review with concrete optimizations, alerts on >10% drift. Typical economics: 40-60% cost reduction in the first 2 months via batching, quantization, and a spot/on-demand mix.