Case study: DePIN sub-operator, 200 nodes across 8 regions, 99.94% uptime

This post walks through one of our ongoing engagements: a DePIN sub-operator whose fleet we have been running for more than half a year. By agreement with the client we are not naming the protocol or the operator, but we can describe the architecture, the working regime, and one incident that turned out to be instructive. The short version: 200 nodes across 8 regions, 99.94% rolling 30-day uptime, and a 28-hour investigation into a DNS incident that was quietly eating about 1.8% of reward on the EU slice of the fleet.

Setup

The client is a DePIN sub-operator running a storage / edge-compute workload (in workload class, comparable to Filecoin or io.net, but the specific protocol is kept out). At onboarding they had roughly 130 nodes, deployed by hand by a single engineer who burned out and left. By the time we took over the fleet, the count was 200: 3 EU, 2 NA, 2 APAC, 1 LATAM region. Provider mix: 4 colocation facilities and 1 bare-metal-on-demand vendor for capacity spikes.

The network's reward token settles on-chain every 6h epoch. Nodes submit proof-of-storage submissions to the network's reward server; the server aggregates and writes on-chain at epoch close. Internally we run our own expected-reward calculator: it observes submissions, applies protocol rules, and predicts what reward should arrive. Then reconciliation: compare the on-chain receipt against the prediction. Drift >0.5% per epoch fires Sev-2. Drift >0.2% sustained across several epochs fires Sev-3.

Our scope: day-2 ops, regional rebalancing (when the protocol shifts weights between regions, we move capacity), reward reconciliation, and incident response with our standard MTTA <15 min and root cause inside a 12-24h window.

The first few months

Boring. Which is, broadly, what good operational engagements should look like. The fleet was stable, reconciliation drift floated in the 0.18-0.4% band against expected. The reconciliation Sev-3 alert never fired in the first four months. A few routine events: one Singapore colo announced a maintenance window, we moved 7 nodes onto temporary capacity at the bare-metal vendor and brought them back 36 hours later. A disk in Amsterdam threw a SMART warning, swapped without downtime to the workload. One protocol upgrade, rolled out canary-style (3 nodes, then 25, then the whole fleet) over 18 hours.

If the engagement had ended there, there would be nothing to write about. The interesting part started in month five.

Month 5, Thursday 14:32 UTC: drift Sev-2

Reconciliation alert: drift jumped from 0.32% (previous epoch) to 1.84% inside a single 6-hour epoch. Pager fired. On-call ack in 6 minutes, inside SLO.

Initial triage showed the drift was structural, not random: about 17 nodes in the EU regions were "ghosting". They were submitting proof-of-storage submissions, our local metrics saw them, but the reward server was not registering them. The reward server treated those nodes as offline for the epoch, and that ate roughly 1.6 percentage points of expected reward. Not catastrophic, but not noise either: clearly something had broken, and broken in the same way across a cohort.

Hours 0-6: first hypotheses

Worked through the obvious layers:

Node health: processes alive, disks healthy, CPU and memory within normal range, local logs showing successful submissions.
Upstream network from each node to the edge: packet loss within normal, RTT normal, nothing dropping.
TLS to the reward-server endpoint: certificate valid, SNI correct, handshake completing.
On-chain settlement: no reorgs in the relevant window, blocks arriving, no missed blocks.
Per-region NTP: spread between nodes <10 ms, well inside protocol tolerance.

By hour 6 all the obvious layers looked clean. Nodes thought they had submitted. Nodes saw the endpoint reply 200. And the reward server simultaneously was not seeing submissions from those 17 nodes. Somewhere along the path, submissions were vanishing.

Hours 6-12: deeper network observability

Ran mtr from each of the "ghosting" nodes to the reward endpoint and compared against routes from 5 nodes in the same region that were healthy. All 22 routes looked clean. No AS-hop anomalies, no peering degradation.

But in the comparison we noticed something: 12 of 17 "ghosting" nodes were resolving rewards-eu.<network>.io to a different IP than the 5 "healthy" nodes in the same region. And the reward server, naturally, was accepting submissions from one set of source IPs but not the other. Submissions were going out, but into the void: hitting an old IP that technically answered but was no longer the logically active reward server for the region.

The question became: whose DNS was lying.

Hour 14: the right resolver tool

This is where it got interesting. On a "ghosting" node we ran resolvectl statistics. The cache hit rate on the reward endpoint was near 100%, which is in principle normal for a frequently queried name. Then resolvectl query rewards-eu.<network>.io --cache=no: returned a different IP than the cached one. So the systemd-resolved cache was serving stale.

We checked the age of the cached record: more than 50 minutes. The reward server publishes an A record with TTL 60 seconds. The node should have re-resolved at least 50 times in that window. It had not done so at all.

Cause: systemd-resolved has an internal cache-max-time (the default at the time of the incident was around 2 hours), and that setting silently overrides the TTL on incoming responses. The response TTL of 60s gets applied as a lower bound, not as an upper bound. While a record is in cache, it sits there for as long as the resolver decides, not for as long as the origin permitted.

The backstory, which we pieced together later from the client and public network documentation: the reward server uses DNS-based regional failover with a 60s TTL. They run normal blue-green deploys, 4-5 times per week in the EU region, flipping the active IP. In the prior months our nodes had also been hitting old IPs for up to 30 minutes after each cutover. Most of the time the old IP was still "alive" enough to accept the connection and forward submissions to the new backend. That week they rolled the cutover onto a fresh host behind a stricter ACL: the old IP answered TCP, answered 200 on health check, but silently dropped submissions from an unauthorized source.

That explained everything. Submissions were going out, the endpoint replied 200, the reward server never saw those submissions.

Hours 14-28: fix by Ansible

The fix split into three steps.

Hour 14-16, targeted fix. Wrote an Ansible role replacing systemd-resolved with unbound on the 17 EU nodes. Minimal unbound config with max-cache-ttl: 0 for the reward-server zone and normal TTL for everything else. Rolled out across the 17 nodes, verified on each one that dig +trace was returning the current IP and that the reward server had started seeing submissions.

Hour 16-22, fleet rollout. Same role in batches of 25 nodes to the remaining 183 nodes across 8 regions. 30-minute observation window between batches. We discussed dnsmasq as an alternative and rejected it: unbound is already running on several bare-metal nodes for another project, and we did not want to operate two resolvers on different slices of the fleet.

Hour 22-28, synthetic check and runbook. Added a synthetic probe: every node once per epoch runs dig +trace against its regional reward endpoint and reports the resolved IP into our central inventory. If any single node lags more than one epoch behind the active IP seen by the rest of the fleet, Sev-3 fires. In parallel we finalized the unbound config: for the reward-server zone max-cache-ttl: 0, for everything else default values. No extra outbound traffic anywhere it does not matter.

Hour 28: drift back to baseline

Reconciliation drift in the epoch right after the rollout dropped by 1.6 percentage points. By the end of the next 24 hours it was back inside the normal 0.18-0.35% band. The client received a 4-page postmortem within 48 hours of incident close, and we calculated reward compensation and applied it in the next billing cycle.

Total reward damage: about 1.8% of expected over a single weekend's worth of epochs, in dollars a four-digit USDC number. Not catastrophic on an annual basis, but enough to justify the entire reconciliation calculator and the Sev-2 alert that caught it.

What went into the runbook

Four changes to the provisioning and observability regime:

Any DePIN workload with DNS-based failover on upstream dependencies is provisioned with an explicit caching policy on the resolver. Default systemd-resolved is forbidden for that class of nodes.
Resolver choice (systemd-resolved vs unbound vs dnsmasq) is now an explicit item on the provisioning checklist, not an implicit default.
Reward-drift Sev-2 triggers a DNS-cache audit ahead of other checks. If within the first 30 minutes of investigation the cohort of drifting nodes correlates geographically, the on-call's first command is resolvectl statistics and dig +trace on a handful of affected nodes.
A synthetic DNS-query check on critical endpoints is now standard observability for any DePIN fleet with DNS-based failover. One dig +trace per node per epoch, reported to inventory, alert on divergence.

The lesson

The fanciest layer in the stack (on-chain economics, settlement, consensus) behaved exactly as documented. The boring layer (libc, systemd-resolved, default cache policy) was quietly eating 1.5% of reward across a weekend of epochs until the reconciliation Sev-2 fired.

The signal had been visible in resolvectl statistics since day one of operations: cache hit rate near 100% on an endpoint that was supposed to invalidate once a minute. Nobody looked, because the node uptime view was green, and because our own Sev-3 reconciliation threshold was set too loose (0.5% per epoch as Sev-2, with no separate Sev-3 on sustained 0.3-0.5% drift). After the incident we lowered the Sev-3 threshold to 0.25% sustained over three consecutive epochs.

There is no magic lesson here. All of the components that broke are well-known and well-documented. The lesson is that in DePIN operations, the layer where breaks cost the most is usually not the layer most dashboards are looking at. Reconciliation between expected and actual reward is not a nice-to-have. It is the only way to notice that the protocol is paying you less than it should, and to notice within hours rather than weeks.

The XIMTRX team