> services / operate

24/7 SRE on-call for Web3 and AI infrastructure

We own the pager. PagerDuty rotations, signed SLA, post-mortem after every Sev-1. Mean time to acknowledge: minutes. Alerts go to engineers, not voicemail.

> what's included

The operate phase starts after handoff from deploy, or when a retainer picks up existing infrastructure.

Scope Owned by us Owned by you
24/7 on-call, PagerDuty rotations, escalations ✓ Owned -
Incident triage and runbook-driven response ✓ Owned -
Runbook versioning and updates ✓ Owned -
OS patching, security updates, certificate rotation ✓ Owned -
Sev-1 post-mortems, trend reports ✓ Owned -
Application / contract release decisions - (we ship on your cue) ✓ Owned
Architectural change decisions - (we recommend) ✓ Owned
Provider and stakeholder billing - ✓ Owned

> how we respond to incidents

Every alert hits a severity classification before it pages anyone. Sev-1 means money or keys are at risk: validator missing blocks, double-sign risk, GPU OOM in prod, a proof job stalled near a deadline. Sev-2 is SLO degradation without immediate loss. Sev-3 is a bug that can wait for business hours.

Each Sev-1 has a runbook with recovery steps, escalation owners and side-effects to watch for. Runbooks live in Git next to IaC: every change is reviewed, nothing happens "from memory". The library is currently 147 runbooks across the 4 ICPs we serve.

Mean time to acknowledge on Sev-1: <15 min p95. Mean time to restore depends on the failure mode but both numbers land in monthly SLA reports. After every Sev-1: blameless post-mortem within 24h with action items and owners.

Default observability stack: Prometheus + Grafana + Loki + OpenTelemetry. PagerDuty for on-call. If you already run Datadog, Honeycomb, or a custom setup, we work on top.

> stack we ship with

Web3: Cosmos SDK Geth Reth OP Stack Arbitrum Orbit Polygon CDK EigenDA Celestia
AI / LLM: vLLM Triton TensorRT-LLM NVIDIA H100 / A100 Ray Kubeflow
ZK: SP1 RISC Zero Boundless Brevis Jolt Halo2
DePIN: Filecoin Akash Render io.net Gensyn
Platform: Kubernetes Terraform Ansible Prometheus Grafana Loki OpenTelemetry PagerDuty

> engagement models

Operate is a long-running engagement by nature. We match the model to infra maturity and criticality.

> severity matrix

What counts as which Sev and what targets apply by default.

Severity Examples Ack target (p95) Post-mortem
Sev-1 Validator missing blocks, double-sign risk, inference outage, prover offline against a deadline <15 min 24/7 Within 24h
Sev-2 SLO degradation without loss, single-node failure with redundancy in place, p95 latency drift <1h business hours, <2h overnight Within 5 business days
Sev-3 Backlog bug, pending upgrade, scheduled window Next business day -

> what we'd build for you

Real coverage patterns across the four ICPs.

Web3 / Validators

24/7 validator monitoring with auto-failover and missed-block escalation. Alerts on double-sign signals, slashing watch, upgrade-window recommendations.

AI / LLM Inference

GPU health watch, OOM auto-recovery, model-rollback playbooks. p95 latency SLO, cost-per-token monitoring, capacity planning for peaks.

ZK / Prover Farms

Prover liveness checks tied to network deadlines. Proof-job queue monitoring, GPU utilization tracking, escalation when block deadlines are at risk.

DePIN / Distributed Networks

Per-node uptime SLA tracking with payout reconciliation. Auto-restarts, regional monitoring, weekly reward reconciliation against network dashboards.

> SLA tiers

Three coverage levels. Pick by criticality and budget.

Tier Response p95 (Sev-1) Coverage Incident report Engineer hours / mo
Bronze 30 min Business hours, 5×8 Within 48h 40
Silver 15 min 24/7 on-call rotation Within 24h 80
Gold 5 min 24/7 with dedicated engineer Within 12h 160+

> related services

> FAQ

Yes. We start with an operate-readiness audit (48h): we check monitoring, runbooks, SLOs, escalation paths. Any gaps get closed before the SLA is signed.

Tiered: Bronze (5×8, 30 min), Silver (24/7, 15 min), Gold (24/7 dedicated, 5 min). Targets cover response, post-mortem timing and monthly hours. Compensation clauses are negotiated case-by-case.

24/7 means 24/7. Holidays and weekends are covered in Silver and Gold. No "we'll try to reach someone": there's a PagerDuty rotation with an explicit escalation chain.

Shared. They live in your Git, we write and maintain them. When the engagement ends, they stay with you, no lock-in to our tooling.

The retainer is re-quoted on meaningful scope changes (e.g. +50% nodes or a new ICP). Small changes inside a Silver/Gold tier are absorbed by the engineer hours included in the tier.

> ready to hand over the pager?

Tell us about the workload. We reply within 24 hours.