by XIMTRX team

Case study: ZK rollup, 6 months of validator ops, slashing: 0

Anonymized case study: validator ops + prover farm for a ZK rollup. 6 months in production, zero slashing, and a 48-hour incident where proof ETA drifted.

#case-study #zk #validators #prover-farm

This post: an anonymized walk-through of six months of validator operations and prover farm work for a ZK rollup team that hit mainnet in Q1 2025. No names, no protocol identifier, no topology details that would out the client. What stays in: the numbers, the timings, and one incident we caught and closed in 48 hours during the fifth month of operations.

Headline outcome: 6 months in production, zero slashing events, p95 proof time 22 percent below their internal baseline pre-onboarding. One Sev-2 along the way, in which proof ETA drifted 15-40 percent above normal, and which turned out not to live where we first looked.

Setup

The team: mid-stage, roughly 12 engineers post Series A, building a ZK rollup on an SP1-based zkVM with a custom proof aggregator on top. Sequencer is their own, an OP Stack fork. Before us they spent nine months on incentivized testnet and approached mainnet confident in the code and unsure who would carry the infrastructure 24x7 without burning out the core team.

The scope they handed us: validator/sequencer operations plus a 24-GPU prover farm (12x H100 for heavy aggregation circuits and 12x RTX 4090 for lighter per-batch witness jobs), spread across two regions. HSM key ceremony, runbook coverage, on-call rotation, follow-the-sun coverage across EU and APAC windows. Their engineers stayed owners of the protocol logic and the application layer. We took everything below the zkVM code: kernel, filesystems, GPU drivers, observability, incident response.

The first months

Onboarding ran along our standard playbook. HSM key ceremony in the first week, two geographies, two independent witnesses on the client side (the same people who later signed the genesis ceremony). By the end of week two the validator and sequencer were on mainnet, prover farm came up in waves: first 8 GPUs in shadow-mode alongside their existing setup, five days later the next 8, and a week later the last 8 with full cutover. Their own prover infrastructure went into hot-standby, decommissioned 30 days later.

The runbook grew from a 140-page baseline to about 320 pages over the first month, mostly the SP1 specifics for this deployment: per-circuit memory profile, the behavior of their proof aggregator under peak load, sequencer behavior during L1 reorgs. From day one of mainnet we had 24x7 monitoring with an MTTA target of under 15 minutes.

Five incidents in the first five months, all Sev-2 or lower. Once the CUDA driver on one of the H100 boxes started throwing ECC errors and we swapped the node onto a hot-spare in 22 minutes. Once an L1 reorg of 3 blocks forced the sequencer to resubmit a batch. Twice a queue depth alert fired before the provers were actually falling behind (false positive, we retuned the threshold). Once a kernel update on an unprepared dependency edge slowed one of the RTX 4090 nodes, we rolled back within the hour. Average MTTA across those five: 11.4 minutes. Not one of them showed up on L1 as a missed batch or a slashing risk.

That is the normal steady state for validator ops we aim for: boring, by the runbook, nothing loud.

Month 5: proof ETA started drifting

Somewhere in the middle of the fifth month an alert fired on per-circuit ETA. For the aggregation circuit on H100 it consistently exceeded baseline by 15 percent, and for the heaviest witness circuits on RTX 4090 by 35-40 percent. Aggregate throughput still looked nearly normal because queue depth had not yet had time to build up.

MTTA on the alert: 8 minutes. Initial triage by the on-call engineer: 30 minutes, and within that half hour it was clear this was not a known runbook pattern. No comparable signal anywhere in the previous five months. Escalation to second-tier on-call, and from there the incident went onto the full cycle: shared client channel, hourly status updates, dedicated bridge.

Hours 0-8: first hypotheses

First thing we checked: had a fresh SP1 release shipped within the last two weeks that might have changed the load profile. One had, a minor release a week earlier. We rolled back the runtime to the previous build on two GPUs as a control group, no difference. Hypothesis ruled out.

Second: GPU thermal. nvidia-smi dmon across the pool showed H100 in the 68-74 C range, RTX 4090 in the 71-77 C range, zero throttling events over a 72-hour window. Ruled out.

Third: proof aggregator backpressure. Queue depth dashboards showed the aggregator was consuming witness data without delay, no input-side queue growth. Ruled out.

By the end of the EU shift three hypotheses were eliminated and we handed off to APAC on-call with a detailed list of what was NOT the problem and what to check next.

Hours 8-24: widening the search

The APAC engineer worked through the next four layers during their shift. CUDA driver: checked versions on all 24 GPUs, identical. L2 fork timing: the sequencer had observed nothing anomalous over the last 200 batches. Client app submission latency: looked at p95 and p99 across all input points, latency unchanged. Full-pool SP1 rollback: rolling rollback executed, ETA did not return.

By the end of the APAC shift, 24 hours after the first alert, we had a long list of what was NOT the root cause: SP1, CUDA, thermal, aggregator backpressure, fork timing, submission latency. Six hypotheses checked, none confirmed. That was already informative: the problem was not in the code, not in the drivers, not in the network, not in the workload itself.

Handoff back to EU. On the morning bridge with the client we said it plainly: we know where it is not, we do not yet know where it is.

Hour 28: the right tool

First hour of the EU shift, one of the engineers remembered that SP1 writes per-circuit witness data to a scratch volume, and that we had a metric for aggregate disk throughput but not for per-file latency. They ran filefrag on several recent witness files on one of the H100 nodes.

Result: 1400+ extents on a single 320MB file. We walked the rest of the pool: on one of the RTX 4090 boxes we found a 180MB file with 1800+ extents. Filesystem: ext4 on an NVMe scratch volume, mount options stock, nothing tuned for a write-heavy workload.

After five months of continuous writes (proof job creates a witness, reads it in random order to generate the proof, deletes it; per-GPU write volume was in the 200-400 GB per day range) ext4 on those volumes had accumulated 18-22 percent fragmentation. iostat -xz 1 looked completely fine because aggregate sequential throughput had not dropped. What had dropped was per-circuit random-read latency on witness reads. A metric we did not have.

That was the root cause. Not SP1, not CUDA, not the GPU. The boring filesystem layer, quietly degrading for five months under a workload it had not been tuned for at initial setup.

Hours 28-44: the fix

We staged the rollout in two phases to avoid concurrent risk across the whole pool.

Phase 1, hours 28-34: a hot-spare NVMe already sitting in each machine was formatted with a fresh ext4 and mount -o noatime,nodiratime,data=ordered, plus large-extent preallocation. Drained the first 4 GPUs (2x H100 + 2x RTX 4090), migrated the scratch volume, rolling restart of the proof workers. One hour after restart we collected metrics: per-circuit ETA on those four GPUs was back into the -3 to +1 percent band relative to baseline. Hypothesis confirmed.

Phase 2, hours 34-44: same procedure on the remaining 8 GPUs, four at a time during drain, no full prover farm halt. By hour 44 all 24 GPUs were running on fresh scratch volumes with the right mount options.

Hour 48: ETA back at baseline

Four hours after phase 2 completed, at the 48-hour mark from the first alert, per-circuit ETA across the entire pool was 23 percent below the peak drift and inside SLA. Aggregate proof throughput came in 18 percent higher than it had been 24 hours before the incident (because random-read latency on witness reads had been eating performance not just during the peak window, but for several weeks before, just below the alert threshold).

Total incident window: ~48 hours from first alert to back in SLA. Severity: Sev-2 at peak, zero missed batches on L1, zero slashing risk at any point. The client postmortem went out 22 hours after the incident closed: 14 pages, no blame, with a concrete list of what we changed in the runbook.

What went into the runbook

Four changes shipped into our internal runbook (and the client-facing copy) from this incident:

  • ext4 fragmentation check at week 12 for any GPU workload writing >100 GB/day. filefrag over a sample of 50 random files on the scratch volume, alert if median extents per file exceeds 200 or the 95th percentile exceeds 1000.
  • filefrag output ships in the proof-job sidecar. Every proof job now logs the fragmentation of its witness file into a structured log. That gives us per-circuit trend rather than a one-shot snapshot, and lets us catch degradation before it hits ETA.
  • Per-circuit p95 ETA alert independent of aggregate throughput. The main lesson on metrics. The aggregate number had been hiding a per-circuit regression for weeks. The p95 ETA alert is now computed independently per circuit class.
  • Hot-spare NVMe pre-formatted with explicit options. Every prover machine now carries a second NVMe pre-formatted with ext4 + noatime,nodiratime,data=ordered, ready for hot swap. That turned phase 1 of the fix from "find a disk and format it" into "switch the mount point", which cut the GPU drain window from an estimated 90 minutes to 22.

The lesson

The boring filesystem layer is where regressions hide after months of running. No code is broken, no driver is at fault, no release broke anything. The workload simply accumulated debt on a layer that had not been profiled for that write volume. Aggregate metrics keep looking healthy because they are measuring something other than what is degrading. When your monitoring is built only on aggregates it will not see the problem until it surfaces in SLA.

And the 48-hour incident budget worked, not because we got lucky. It worked because we had APAC and EU follow-the-sun rotation that did not lose context across handoffs, a hot-spare NVMe already in every machine, and a postmortem format rehearsed on previous Sev-2 incidents. Day and a half to diagnose somebody else's filesystem debt, half a day for a two-phase rollout, zero slashing for the entire window. That is the managed DevOps we sell: not "we never have incidents", but "incidents happen and close inside the budget we promised".

The XIMTRX team

← All posts