[ Validator set across 3 regions ] →
Cosmos SDK / Geth / Reth, key isolation on HSM/KMS, distributed lock against double-sign, slashing alerts, missed-block dashboards, failover playbook.
> solutions / web3
Testnet next quarter. Hiring an SRE who actually knows Cosmos SDK is a 4-month process, plus equity. We get your validators live in 3 regions in 5 days, with slashing alerts and a signed uptime SLA.
We arrive with a working stack: Cosmos SDK, Geth, Reth, OP Stack, Arbitrum Orbit, Polygon CDK, EigenDA, Celestia. For each client protocol we wire up signing observability, distributed locks against double-sign, HSM/KMS workflows, and runbooks for Sev-1 events (slashing triggers, peer drops, fork-choice mismatches).
Whitespace we sit on: post-deploy operations for L2s. Conduit, Caldera and Altlayer own sequencer ramp-up; ops after launch (monitoring, reboots, migrations, hard-fork cutovers) is still an open position. We take it.
Web3 subset. The platform layer is identical across ICPs.
Concrete deliverables for Web3 teams. Each one ships end-to-end with repo, IaC and runbooks.
Cosmos SDK / Geth / Reth, key isolation on HSM/KMS, distributed lock against double-sign, slashing alerts, missed-block dashboards, failover playbook.
Geth / Reth read-replicas with per-method rate-limits, hot-path caching, p95 latency SLO, geo-routing for global traffic.
Sequencer + batcher + proposer as separate processes, L1 finality monitoring, switchover playbook, hot-standby in another region.
Burst delivery for incentives programs: bare-metal sourcing, auto-onboarding, equal-load region spread, leaderboard-position dashboard.
Light nodes with signed uptime, retrieval latency tracking, missed-header playbook, sync with the consensus layer.
After handoff the pager lives with us. Coverage tuned for validators and rollups:
What we move without downtime and without exposing key material.
Validator-set cutover to mainnet with a key ceremony, state sync, checkpointing, and a rollback plan.
Validators moved off AWS/GCP onto Latitude.sh or OpenMetal: typical 40% reduction in per-node costs with no latency penalty.
Coordinated client upgrade for a known fork height: pre-flight checks, canary node, rolling restart by region.
L2 sequencer moved to another jurisdiction or provider without dropping blocks: hot-standby promote + DNS cutover.
RPC split per-region as traffic grows: anycast / geo-DNS, cache warm-up, per-region rate-limits.
Parallel sync, per-block checksum verification, smooth switch with no missed slots.
Anonymized. NDAs cover names; the numbers are real.
Three coverage levels. For validators and sequencers we recommend Silver or higher: slashing risk does not tolerate 5x8.
| Tier | Response p95 (Sev-1) | Coverage | Incident report | Engineer hours / mo |
|---|---|---|---|---|
| Bronze | 30 min | Business hours, 5×8 | Within 48h | 40 |
| Silver | 15 min | 24/7 on-call rotation | Within 24h | 80 |
| Gold | 5 min | 24/7 with dedicated engineer | Within 12h | 160+ |
You do. HSM/KMS workflow where keys never leave your control. We sign via a signer daemon with a distributed lock; we don't custody material. Optional MPC setup (CGGMP-21 / FROST) where the protocol supports it.
Architecturally we rule out double-sign through a distributed lock: the signing key flips to read-only if consensus with the other instance can't be reached. Financial responsibility depends on tier: Gold includes a slashing-insurance discussion, Bronze/Silver use a shared-risk model. Across 3 years of ops in the current team: 0 slashing incidents.
Supply window: 72h from signed contract to first live node. Regionally distributed rollout closes within 5-7 days. Send the protocol spec plus target regions; we reply with a concrete window in 24h.
Yes, it's one of our primary stacks. That covers custom modules, IBC relayers, governance voting, and upgrade-handler migrations across major versions. CometBFT, CosmWasm, IBC v2: all in scope.
Yes. Onboarding: 1 week to inventory existing infra, import IaC (or regenerate via Terraform), move keys via a ceremony, take the pager. If anything is critically broken before handoff we fix it first, then sign the SLA.