[ FIXED PLAN ] →
Operate readiness audit. 48h: we assess current infra, runbook gaps, SLO readiness.
For when you're not sure your infra is ready for a signed SLA.
> services / operate
We own the pager. PagerDuty rotations, signed SLA, post-mortem after every Sev-1. Mean time to acknowledge: minutes. Alerts go to engineers, not voicemail.
The operate phase starts after handoff from deploy, or when a retainer picks up existing infrastructure.
| Scope | Owned by us | Owned by you |
|---|---|---|
| 24/7 on-call, PagerDuty rotations, escalations | ✓ Owned | - |
| Incident triage and runbook-driven response | ✓ Owned | - |
| Runbook versioning and updates | ✓ Owned | - |
| OS patching, security updates, certificate rotation | ✓ Owned | - |
| Sev-1 post-mortems, trend reports | ✓ Owned | - |
| Application / contract release decisions | - (we ship on your cue) | ✓ Owned |
| Architectural change decisions | - (we recommend) | ✓ Owned |
| Provider and stakeholder billing | - | ✓ Owned |
Every alert hits a severity classification before it pages anyone. Sev-1 means money or keys are at risk: validator missing blocks, double-sign risk, GPU OOM in prod, a proof job stalled near a deadline. Sev-2 is SLO degradation without immediate loss. Sev-3 is a bug that can wait for business hours.
Each Sev-1 has a runbook with recovery steps, escalation owners and side-effects to watch for. Runbooks live in Git next to IaC: every change is reviewed, nothing happens "from memory". The library is currently 147 runbooks across the 4 ICPs we serve.
Mean time to acknowledge on Sev-1: <15 min p95. Mean time to restore depends on the failure mode but both numbers land in monthly SLA reports. After every Sev-1: blameless post-mortem within 24h with action items and owners.
Default observability stack: Prometheus + Grafana + Loki + OpenTelemetry. PagerDuty for on-call. If you already run Datadog, Honeycomb, or a custom setup, we work on top.
Operate is a long-running engagement by nature. We match the model to infra maturity and criticality.
Operate readiness audit. 48h: we assess current infra, runbook gaps, SLO readiness.
For when you're not sure your infra is ready for a signed SLA.
Signed SLA + on-call. Bronze / Silver / Gold tiers. Monthly billing.
The default for operate. Most clients live here.
Burst on-call. For one-off events: mainnet launches, hard-forks, incentivized testnets with deadlines.
When you need to reinforce your team for 2-8 weeks.
What counts as which Sev and what targets apply by default.
| Severity | Examples | Ack target (p95) | Post-mortem |
|---|---|---|---|
| Sev-1 | Validator missing blocks, double-sign risk, inference outage, prover offline against a deadline | <15 min 24/7 | Within 24h |
| Sev-2 | SLO degradation without loss, single-node failure with redundancy in place, p95 latency drift | <1h business hours, <2h overnight | Within 5 business days |
| Sev-3 | Backlog bug, pending upgrade, scheduled window | Next business day | - |
Real coverage patterns across the four ICPs.
24/7 validator monitoring with auto-failover and missed-block escalation. Alerts on double-sign signals, slashing watch, upgrade-window recommendations.
GPU health watch, OOM auto-recovery, model-rollback playbooks. p95 latency SLO, cost-per-token monitoring, capacity planning for peaks.
Prover liveness checks tied to network deadlines. Proof-job queue monitoring, GPU utilization tracking, escalation when block deadlines are at risk.
Per-node uptime SLA tracking with payout reconciliation. Auto-restarts, regional monitoring, weekly reward reconciliation against network dashboards.
Three coverage levels. Pick by criticality and budget.
| Tier | Response p95 (Sev-1) | Coverage | Incident report | Engineer hours / mo |
|---|---|---|---|---|
| Bronze | 30 min | Business hours, 5×8 | Within 48h | 40 |
| Silver | 15 min | 24/7 on-call rotation | Within 24h | 80 |
| Gold | 5 min | 24/7 with dedicated engineer | Within 12h | 160+ |
Yes. We start with an operate-readiness audit (48h): we check monitoring, runbooks, SLOs, escalation paths. Any gaps get closed before the SLA is signed.
Tiered: Bronze (5×8, 30 min), Silver (24/7, 15 min), Gold (24/7 dedicated, 5 min). Targets cover response, post-mortem timing and monthly hours. Compensation clauses are negotiated case-by-case.
24/7 means 24/7. Holidays and weekends are covered in Silver and Gold. No "we'll try to reach someone": there's a PagerDuty rotation with an explicit escalation chain.
Shared. They live in your Git, we write and maintain them. When the engagement ends, they stay with you, no lock-in to our tooling.
The retainer is re-quoted on meaningful scope changes (e.g. +50% nodes or a new ICP). Small changes inside a Silver/Gold tier are absorbed by the engineer hours included in the tier.