From Alert Fatigue to Optimized Notifications with System Datasets
Alert fatigue rarely begins as a single mistake. It grows as systems scale, teams grow, and “just in case” monitoring becomes the default. A few extra alerts, another threshold, and soon the on-call channel becomes overwhelmed. Engineers get interrupted for noise or stop trusting pages; either way, real signals get missed. Reliability drops, and productivity quietly declines.
Most teams respond tactically: tune thresholds, change notifications, suppress noise. That helps, but it misses the core problem: alerting is treated as a configuration problem, not as a system with outcomes you can measure.
To fix alert fatigue, you need two things: visibility into the alerting activity and governance that treats alerting like any other subsystem.
Coralogix’s alerts.history System Dataset makes that possible. It turns alert behavior into queryable data so teams can measure patterns, apply consistent standards, and connect alerts to operational impact. The goal is simple: interruptions should be intentional, justified, and tied to real risk.
Alert governance begins with the right dataset
The system/alerts.history dataset captures alert evaluations and their outcomes. It answers questions that are usually anecdotal: which alert definitions are most pervasive, which ones repeatedly trigger, and which ones are actually high-impact? It grounds everyone in a shared source of truth, which is critical if you want governance to scale across teams.
What makes system/alerts.history especially useful for governance is that it includes both sides of the alerting story in every event:
- The alert definition (alertDef)
- The alert instance (alert).
Alert definition fields like alertDef.id, alertDef.name, alertDef.type, and alertDef.priority are the governance layer. They let you treat an “alert” as a managed asset with an identity, intent, and severity posture, so you can answer questions like “which setups generate most of our interruptions?” and “are we reserving P1 for truly urgent conditions?”
Alert instance fields like alert.id and alert.status form the behavioral layer. They describe what happened at runtime: when alerts fired, how often they fired, and whether they oscillated between Triggered and Resolved. Together, these fields expose the data you need to monitor, understand, and govern alerting across teams. Definitions describe intent, and instances show operational reality.
Step one: quantify attention grabs
Governance starts with measurement, and measurement begins with volume. The first useful question isn’t “are we fatigued?” but “what is generating most of the alert stream?” The goal is to see where attention is being spent.
Start by finding which alert definitions (by alertDef.id / alertDef.name) generate the most alert events:
source system/alerts.history
| top 50 alertDef.id as alert_def_id, alertDef.name as alert_name by count()
These counts replace guesswork and tribal knowledge with trackable data. Once you see which alert definitions dominate the stream, you can quickly focus on whether the noise reflects real instability, a mis-tuned threshold, or an alert that belongs in passive visibility instead of paging.
This matters because fatigue is often misclassified. Not every “interesting” metric deserves an interrupt. Governance means being explicit about what should page and what should stay on dashboards.
Step two: find unstable, noisy alerts
Fatigue isn’t only about quantity; it’s also about volatility. Alerts that repeatedly trigger and resolve create constant interruption and confusion. Even if each event is “correct,” responders get exhausted, and the org either stops reacting or overreacts.
Because system/alerts.history includes status, you can measure trigger/resolution behavior. The exact meaning of “Resolved” depends on how your alerts are configured, but even a crude churn indicator is valuable. It highlights alerts that repeatedly interrupt without producing stable incident narratives.
To detect instability, measure state changes (Triggered ↔ Resolved) by alert definition and look for high churn:
source system/alerts.history
| groupby alertDef.id as alert_def_id, alertDef.name as alert_name
aggregate
count_if(alert.status == 'Triggered') as triggered,
count_if(alert.status == 'Resolved') as resolved
| top 50 alert_def_id, alert_name, triggered, resolved by triggered / if(resolved == 0, 1, resolved)
“Churny” alerts are prime candidates for hysteresis, longer windows, dependency checks, or redesigning the signal. Reducing churn is often the fastest way to make on-call feel manageable again.
Step three: make alert load visible over time
Averages hide pain. A system can look “fine” per day but still create brutal peaks. Productivity loss often happens when interruptions cluster too tightly. Time bucketing lets you see those peaks. Because the dataset stores timestamps as epoch seconds, you can parse and round them into hourly buckets:
source system/alerts.history
| groupby roundTime(parseTimestamp(alert.timestamp:string, 'timestamp_second'), 1h) as hour_bucket
aggregate count() as alert_events
| top 50 hour_bucket, alert_events by count()
| sortby hour_bucket asc
This supports governance because you can define explicit expectations, like “interruption budgets” per hour during business hours. You can then check whether alerting stays within what a team can realistically handle. When spikes exceed capacity, you have evidence to justify redesign or reliability work.
It also turns alert tuning into planning: if a team gets a steady stream of real, high-impact alerts, they need time to fix root causes. Alert load data helps make that case clearly.
Governance as an engineering system
The key is feedback. Alerting becomes governable when you measure behavior continuously, not only in post-incident frustrations. High-functioning orgs treat alert governance as a recurring engineering loop: measure, review, improve, and enforce basic standards. This doesn’t require heavy processes, just shared visibility, consistent metrics, and the willingness to retire or redesign alerts that don’t earn their interruption cost.
A practical governance rhythm looks like this:
- Teams review the loudest alert definitions
- Evaluate priority mix and churn
- Identify the top three redesign candidates
- Implement changes that reduce interruption without hiding risk
Over time, you build a healthier alert portfolio: pages mean something, responders trust signals, and awareness comes from dashboards and SLOs, without constant noise. System Datasets make this scalable because governance is not based on opinion, but on data, and the same approach can work across many teams.