by XIMTRX team

Case study: incentivized testnet, 50-node burst in 72h, top-5 operator

Anonymized case study: a burst engagement for an incentivized testnet. 50 nodes in 72h, finished top-5 by uptime. And the hidden AWS quota that almost killed the first 24 hours.

#case-study #testnet #burst #multi-region

This case is anonymized as much as we can within NDA. The client: a mid-stage protocol, Cosmos SDK-based L1 with a custom IBC-style cross-chain settlement layer, getting ready for a six-week incentivized testnet. Their internal team thought they had enough infra. The early-access wave hit, traffic 8x'd in a week, and they needed 50 production-grade nodes spread across 5+ regions inside 72 hours.

The contract ran eight weeks total: a 72-hour emergency stand-up, steady-state ops through the testnet, then a clean off-boarding. Headline outcome: 50 nodes live in 70 hours, finished the testnet ranked top-5 operator by validated uptime out of about 120 participating operators.

Below is the timeline. Hours are counted from Friday 18:00 UTC, when the call came in.

Setup

Friday call, 18:00 UTC. Referral from another operator we had worked alongside on a previous testnet. The ask: 50 nodes across 5+ regions by Monday EOD. Spec per node: 16 vCPU, 64GB RAM, 2TB NVMe, decent bandwidth (at least 1 Gbit and good p95 to both east and west). Cosmos SDK node in a standard Docker wrapper, plus a metrics sidecar and a custom relayer for their IBC variant. Nothing exotic about the workload per node; the exotic part was the count and the deadline.

We had about 28 nodes of free capacity at the moment of the call sitting in existing colos: slots in Frankfurt, Amsterdam, and Singapore where the iron was already racked and waiting for a workload. The other 22 had to come from somewhere.

T+2h: the plan

Sat down to do the supply-window math by 20:00 UTC. Came out to three delivery streams.

Stream one: 14 nodes at existing colo partners (Frankfurt 6, Amsterdam 4, Singapore 4). Physical delivery is on the provider side, with a typical ordering SLA of around 24 hours from confirmation at these facilities. On top of provisioning we keep pre-baked images and pre-agreed specs, so our layer (Terraform apply, validator client, monitoring, runbook rollout) lands in parallel by the number of engineers on shift and adds 30-45 minutes per node once the host is physical-ready.

Stream two: 8 bare-metal-on-demand. Latitude.sh gives us 6 machines across their LA and Dallas POPs with a 90-minute provision guarantee (we keep pre-baked images with them), OpenMetal gives 2 in their Amsterdam cluster. Most expensive layer per unit, but also the most predictable one: bare-metal-on-demand is essentially a price for priority.

Stream three: 28 cloud burst. AWS us-east-1 + us-west-2 (15 nodes), Hetzner Cloud FRA (6), OVH GRA (4), GCP asia-southeast1 (3). Cloud burst is what we use for exactly these situations: bad unit economics, but it gives you three things that matter more than cost in this window: speed, regional flexibility, and a painless way to scale out after the testnet ends.

Plan signed off with the client at 20:00 UTC. Terraform modules for all three streams were already in place, we just had to parameterize for their Cosmos image and pre-pull the custom relayer binaries.

T+6 to T+30: bringing it up

From midnight UTC onward we started rolling Terraform applies in parallel across three engineers on shift. The client's Cosmos node is a composite image: a base node-image from their public registry, two sidecar containers (metrics and relayer), and a set of custom configs delivered configmap-style. Pre-pulled to every target region at T+3h so we would not hit registry rate-limits at peak.

Key ceremony with the client ran on our standard playbook: HSM-backed, we hold the infrastructure, the client signs locally, validator keys never leave their perimeter. For 50 nodes that is 50 separate ceremonies with routing verification at each endpoint, takes about three hours in total (parallelizable with the stand-up itself).

Monitoring and alerts went up as part of the apply flow: Prometheus stack, blackbox probes on peer ports, alerts going to their Discord and our PagerDuty.

By Saturday 18:00 UTC (T+24h) the first 22 nodes were in production and syncing with the testnet. Existing colo plus bare-metal-on-demand had gone exactly to plan.

T+28h: us-east-1 stalled

Next we moved into AWS us-east-1 to bring up cloud-stream nodes 35-50. The first 40 nodes across all clouds had provisioned cleanly. Node 41: EC2 console returned InsufficientInstanceCapacity. Node 42: same. Only for our specific instance type (we use m6id for the NVMe + memory balance).

First hypothesis: classic regional capacity issue. AWS periodically drains popular types in us-east-1, usually clears with an AZ swap or a 30-60 minute wait. Switched AZ, no help. Waited 90 minutes, no help.

And that is when the second symptom started. Existing nodes in the same AWS account (the 13 we had brought up in us-east-1 earlier) began losing connectivity to about 30 percent of testnet peer IPs. TCP connects were timing out, while the same peer IPs answered cleanly from other regions and other AWS accounts.

At this moment the two symptoms (capacity and peer-connectivity) did not yet look related. A capacity blip plus a coincidental testnet-side network glitch is a plausible hypothesis, and that is what we were testing first.

T+30 to T+33: what we ruled out

Layered triage, fast:

  • Peer IPs are not blocking us. Spun up a controlled probe from Hetzner FRA: the same peer IPs that were timing out from us-east-1 answer in 40-60ms. Symptom is local to our us-east-1 account.
  • Security group and NACL. Untouched since the stand-up 24 hours earlier, working before. Manual diff: rules unchanged.
  • VPC NAT capacity. Single-AZ NAT gateway, traffic well within CloudWatch limits.
  • Subnet IP exhaustion. /22 subnet, about 30 percent of addresses in use.
  • AWS service health dashboard. All green.

By the third hour it was clear the symptom did not match any of the standard suspects. And that was the fork in the road. We could keep digging hypotheses (each taking about an hour), or accept the symptom was strange and escalate.

T+33h: opening a P1 support ticket

Opened a P1 with AWS Support at T+33h, described both symptoms (capacity on new instances and connectivity loss on existing ones, all in one account, one region). The AWS TAM picked the ticket up within an hour and called back at T+34h with a hypothesis.

The hypothesis turned out to be the diagnosis. Two weeks earlier he had worked with another customer on a similar symptom. Some enterprise accounts at AWS get configured by default with a hidden per-account quota on egress to specific peering networks. This is not the public Data Transfer Out quota. It is a separate internal quota config that does not surface in aws service-quotas list-service-quotas and is not visible through Trusted Advisor. The quota silently rate-limits new TCP connects to specific AS numbers once cumulative traffic to them crosses a 24-hour threshold.

The testnet our client was running had a noticeable concentration of peers in one of those AS networks. Our aggregate traffic crossed the threshold around T+27h, after which AWS started silently dropping new connects to that subnet and, as a second side effect of the same quota, flagged our account as "high risk" for capacity allocation, which produced the InsufficientInstanceCapacity result.

The TAM verified the account, lifted the quota, and at T+34h 40m both symptoms collapsed: new EC2 went into provision, existing nodes recovered peer connectivity within five minutes.

T+30 to T+38: parallel actions

In parallel with the support ticket, starting at T+30h, we did not just sit and wait, we opened the second and third contingency lanes and re-scoped the workload so the outcome no longer depended on the AWS escalation.

  • 8 of the remaining 15 us-east-1 nodes moved to GCP asia-southeast1 + Hetzner Cloud FRA. Terraform plan was ready in 18 minutes (the modules were already there, we only had to swap the provider block and region parameter), apply ran in 47.
  • 7 us-east-1 nodes were kept in the plan, betting that support would either confirm the quota in the next 3-4 hours or not, and either outcome would tell us whether to migrate the rest.
  • In parallel, we prepped an evacuation plan for the 13 existing us-east-1 nodes in case AWS refused the quota lift: a cutover script to re-sign through standby infra in GCP, drain-and-reload without losing signing state. Did not need to run it.

Once the TAM lifted the quota at T+34h 40m, the remaining 7 us-east-1 nodes provisioned within an hour. All 50 nodes were in production by T+70h, Monday 10:00 UTC, 14 hours before the EOD deadline.

What went into the runbook

Five entries got added to our internal runbook from this one incident.

One: any burst of more than 30 nodes into a single cloud account inside a window of less than 48 hours gets a proactive support ticket BEFORE we hit the limit, not after. It is cheaper to open a ticket that says "planning a burst of X nodes, please verify quotas on this account" and have confirmation, than to triage a hidden quota at T+30h.

Two: regional diversity targets. No single cloud account holds more than 15 nodes for any burst engagement. At 50 nodes that automatically means at least four providers, which by construction hedges both against capacity issues and against account-level faults like the one we caught.

Three: AWS hidden egress quotas are now a known unknown. Every new client that uses AWS in production gets a quota-inspection support ticket filed in week one of setup. The ticket explicitly asks the TAM to enumerate all hidden quotas on the account, including the ones that do not show in public APIs. That will not catch everything (AWS is not obligated to surface everything), but it removes the most common surprises.

Four: pre-pull custom Docker images to every target region before the key ceremony. We were already doing this, but it used to be an implicit step in the lead engineer's checklist. It is now an explicit gate: ceremony does not start until pre-pull is confirmed in every region.

Five: emergency move-and-restart playbook for migrating a node mid-testnet without losing rewards. We had a generic migration runbook before this case, but with no tight timing for the testnet scenario where downtime is critical. It is formalized now: drain, snapshot signing state, restore on the target infra, verify peer handshake, all inside a rolling window where missed slots stay below the slash threshold.

The lesson

One: cloud "capacity" is a marketing term. Real capacity is whatever your specific account is allowed to use today, and that number can differ from yesterday and from the public docs. Hidden quotas exist, and some of them you only see through a P1 ticket.

Two: geographic spread is not just an uptime hedge. It is an account-level fault-isolation hedge. If 50 nodes had lived in one AWS account, we would not have hit 72 hours by any effort. Three cloud providers, two bare-metal vendors and three colos gave us the option to reroute mid-incident, without waiting on AWS to make a decision.

Three: a P1 in the first 30 minutes of a weird symptom is almost always the right move, even when you have no hypothesis yet. The TAM who helped us recognized the symptom on the spot because he had seen it the week before. We could have spent four hours rediscovering his hypothesis on our own; we got it in one hour because we escalated before we had finished thinking.

50 nodes in 70 hours happened not because the first lane of the plan worked. It happened because the playbook kept three contingency lanes ready.

The XIMTRX team

← All posts