Validator observability: what we alert on, and what we don't

Typical validator monitoring is built like this: an alert on "the process died," an alert on CPU, an alert on disk, and the on-call drowns in notifications, half of which mean nothing. Meanwhile what actually loses money slips by: the node is green while reward or stake leaks. We run observability for clients' validators, and the main work here is not "add more metrics" but choosing what NOT to alert on. Below is that discipline.

Alert on the symptom, not the cause

The classic mistake is alerting on causes: high CPU, disk filling up, a process restart. Causes are infinite, most are harmless on their own, and each such alert is a false alarm more often than a real one. The on-call quickly learns to ignore them, and then they are useless.

We alert on symptoms, on what directly hurts the client: the validator misses an attestation, response latency drifts into the tail, the share of checks passed drops. A symptom does not lie: if it fired, the client is already in trouble or about to be. High CPU then lives on a dashboard for diagnosis, but does not wake anyone at night by itself. The on-call finds the cause from the symptom, not the other way around.

Which SLIs actually matter

The set of signals on a validator is short, and it is not about hardware:

Attestation rate and timeliness. The baseline measure of useful work: is the validator doing what it is paid for, and on time.
p99 latency on attestation and on responses, not the average. Why the tail specifically is covered in the post on how slashing is not about uptime.
Signs of equivocation and key conflict. Top priority, immediate escalation: this is the category that costs the stake.
Diversity and blast radius. Variety of providers, ASNs, client versions: correlated risk has to be seen before the incident, not after.

"Uptime percentage" is not a primary signal here but a derivative. It is in the report, but it is not what wakes the on-call.

An alert without a runbook is not an alert

An alert the on-call does not know what to do with is just panic at 3 a.m. So we have a rule: every alert that wakes a human links to a runbook that says what to check and what steps to take. If a symptom cannot have a runbook written for it, either the symptom was chosen badly or we do not understand the system, and both need fixing before the alert ships to production.

This is what keeps the number of waking alerts small: to add a new one you must write its runbook, and that is a natural filter against "just in case" alerts.

Noise kills the on-call

Alert fatigue is not about comfort, it is about safety. An on-call pulled out of bed ten times overnight for trivia reacts slowly to the eleventh, real one, or waves it off. So noise is not a cosmetic problem but a direct risk of missing an equivocation.

We keep the signal-to-noise high on purpose: every false night alert is a bug in the monitoring to be fixed, not tolerated. Ten precise alerts a month, each of which the on-call takes seriously, beat a hundred they have stopped looking at.

What it looks like for a client

On the polygon we set up the stack on Prometheus, Grafana, and Alertmanager so that waking alerts are symptom-based and each has a runbook, and we deliberately run scenarios to check that the important things fire while the noise stays quiet. On a client contract this becomes their set of SLIs, their alerts with runbooks, and their diagnostic dashboards, tuned for the specific network.

If you need monitoring that wakes you for a reason and does not drown you in noise, that is part of what we cover through operate. Want us to look at your current alerting for noise and blind spots: get in touch.