by XIMTRX team

Failover that can't double-sign

The scariest thing in validator ops is not a node going down, it is two nodes that both believe they are the active signer. We cover why ordinary failover is dangerous and how we build switching that cannot get slashed.

#validators #failover #slashing #operations #web3

When people are sold "high availability for a validator," they picture a node going down and a second server taking over in seconds. The picture is pleasant and dangerous. For a validator the worst case is not "the node fell over" but "two nodes both believe they are the active signer." The first signs a block, the second signs the same slot with its key, the protocol sees two signatures from one validator and cuts the stake. That is a double-sign, and ordinary failover does not just fail to protect against it, it leads straight to it.

We run validator infrastructure for clients, and this is the problem on every contract. Below we cover why naive switching gets slashed, and what a failover that cannot sign twice looks like.

Split-brain: why naive failover kills the stake

The classic HA setup looks like this: the primary node signs, a standby sits next to it and monitors the primary, and when the heartbeat disappears the standby takes the key and starts signing itself. It works right up to the first network break between them.

The link drops but the primary is alive: it keeps signing, because from its own side everything is fine. The standby sees no heartbeat, decides the primary is dead, and starts signing too. Now two nodes sign the same slots with different states. That is split-brain, and for a validator it is not a service degradation but a guaranteed slashing incident.

The root of the problem is that the standby decides based on what it cannot see. "I am not getting a heartbeat" and "the primary is dead" are not the same thing. A broken link between two nodes looks identical to each of them: the other side went quiet. Until the standby can tell "the primary died" apart from "I lost the link to a live primary," any automatic promote is a roulette wheel with the client's stake on the table.

STONITH: until the old one is dead, the new one does not sign

The only reliable way to break that uncertainty is to stop guessing the primary's state and forcibly zero it out before promoting. The discipline is called STONITH: shoot the other node in the head. Before the standby takes the key, the primary must be guaranteed dead, not presumed dead.

"Guaranteed" means an external action that does not depend on how the primary feels about itself:

  • Power fencing. Pull the primary's power through a managed PDU or IPMI, wait for confirmation that the port is dead, and only then promote the standby.
  • Network fencing. Close the primary's port on the switch, physically cutting it off the signing network, and get confirmation from the switch.

The key word is confirmation. The standby does not promote "5 seconds after the heartbeat disappears." It promotes after the fencing path returns "primary powered off" or "primary port closed." No confirmation, no promote: better to miss a few attestation slots and lose pennies to inactivity than to sign twice and lose the stake.

vLAN fencing and a witness instead of guessing

Power fencing is great when you have managed access to power. In cloud and in some colo you do not, and then the main primitive becomes network fencing at the vLAN level plus an external witness.

The witness is a third point that does not take part in signing and sits in a separate network. The promote decision is made not by the standby alone but by a quorum: standby plus witness. If the standby has lost contact with the primary but the witness still sees the primary, there is no promote, because the primary is alive. A promote happens only when both standby and witness agree the primary is unreachable and the switch fencing has fired. The third vote removes the exact situation where one node judges another's life by silence on a link.

This is more expensive and slower than naive heartbeat failover. That is the price of switching that does not cost the client its stake.

What we do not do

Half the discipline is refusing the convenient but dangerous patterns:

  • Automatic promote on a timer. No "heartbeat gone for 10 seconds, promote." Promote only after confirmed fencing.
  • The same key on two hot nodes without fencing. A hot standby with the same key and promote automation is a loaded double-sign waiting for a network break.
  • Failover logic in the same network as signing. The promote-decision path and the witness live in a separate network, otherwise a single network incident takes down both signing and arbitration at once.

These three rules are boring, and they save the stake more often than any latency optimization.

What it looks like for a client

On our own polygon we drill the fencing scenarios before they go onto a client's validator: we cut links, kill the primary mid-signing, and check that the standby does not promote without confirmation. On a client contract this turns into a runbook: which PDU or switch we pull, where the witness sits, how many slots we are willing to miss before we escalate to the on-call engineer.

If you run a slashing-sensitive validator and want failover that would rather miss a few slots than sign twice, that is part of what we run in web3 operations and cover through operate. Want us to take your current HA setup apart for split-brain risk: get in touch.

← All posts