by XIMTRX team

Slashing is not about uptime

99.95% uptime is the metric everyone quotes and the wrong one for slashing risk. Uptime and slashing risk are nearly orthogonal. We cover where the danger actually lives and what we monitor instead of an uptime percentage.

#validators #slashing #observability #operations #web3

"We have 99.95 percent uptime" is the first thing most operators say, and for slashing risk it is a nearly useless number. Uptime and slashing risk sit on different planes: you can have a beautiful uptime and get slashed, or miss hours and lose nothing. We run validators for clients and count differently. Below is where the danger actually lives.

Two different slashings

First, separate what the word "slashing" lumps together. Protocols punish two different things, and confusing them is expensive.

Downtime penalty (inactivity). The validator failed to do its job: missed an attestation, did not propose a block. Usually this is a small penalty or an inactivity leak that accrues slowly. Annoying, but mostly about coins, not catastrophe. This is what uptime is responsible for.

Equivocation (double-sign). The validator signed two conflicting messages for one slot. That is no longer "underperformed" but "signed a contradiction," and the protocol hits it hard: a large slash plus ejection from the set. Uptime has nothing to do with this. You can have 100 percent uptime and get slashed for equivocation in one second of a bad failover.

All the serious slashing exposure lives in the second category, and uptime does not measure it at all.

Why uptime hides the tail

Even where uptime is relevant, the percentage hides the distribution. 99.95 on two nodes can mean completely different things. One node went down for four hours at once and came back: visible, understood, fixable. The other caught micro-jitter two hundred times in a month and missed an attestation here and there, accumulating the same total downtime: invisible in the percentage, but often the symptom of a creeping problem that tomorrow spills into equivocation on a nervous failover.

Slashing happens on the tail of the distribution, not at the mean. So we watch p99 attestation latency and the shape of the miss distribution, not a single averaged figure that smooths them away.

Correlated failures

The real catastrophe is not one node going down but many nodes going down for one reason at once. If a client's whole fleet sits with one provider, on one consensus-client version, behind one upstream, then a single bug, a single coordinated outage, a single bad client release takes everything down together. Last month's uptime says nothing about this: it measures each node's past separately, not the shared blast radius.

Worse, it is the correlated failure that creates the conditions for equivocation. A region drops, automation starts bringing up the reserve elsewhere, and the first server is actually alive and will return in a minute: that is where the double-sign is born. So a correlated failure is not only a downtime risk, it is the trigger for the most expensive slash.

What we alert on instead of uptime

We do have uptime on the dashboard, but it is a derivative, not a primary signal. What we actually alert on is different:

  • p99 attestation latency, not the average. The tail is the early symptom.
  • Any sign of a double-sign or key conflict as a Sev-1 with immediate escalation. This is the category half our runbooks exist for.
  • Fleet diversity: how many unique providers, ASNs, consensus-client versions. Falling diversity is rising correlated risk, and you need to see it before the incident.
  • Failover behavior: does fencing fire, was there a promote without confirmation. There is a separate breakdown in the post on failover that can't double-sign.
  • Inactivity penalties as a trend, not a one-off miss: a slowly growing leak is a sign that something is degrading systematically.

None of these signals folds into a pretty percentage for a landing page. That is why they are not shown, and that is exactly what slashing happens on.

What it looks like for a client

On the polygon we deliberately drive nodes into correlated and tail scenarios to see how monitoring and failover behave before it happens on a client. On a contract this becomes a set of alerts and runbooks tuned for equivocation and correlated risk, not for "the process is alive."

If you want an operator that measures slashing risk where it actually lives rather than reporting uptime, that is part of what we run in web3 and cover through operate. Want us to look at your real slashing exposure: get in touch.

← All posts