Callen received a text alert warning her that the company page can’t connect to the database. Prometheus revealed that the database API call failed. Callen quickly switches the system to the backup database. The company’s webpage is restored and the day is saved!
When a system experiences an issue and isn’t able to fix itself, an alert is triggered. An alert is a notification informing about a change of state, usually a problem. There are a variety of ways to deliver alerts:
- Email alerts
- Slack alerts
- SMS/phone calls
Imagine receiving alerts across all of these channels, all throughout the day. We may develop alert fatigue, where we start ignoring alerts or turning them off. To avoid alert fatigue:
- Only alert when immediate human intervention is required
- Alert based on customer facing issues
- Set clear ways to indicate urgency
- Ensure an alert is not a copy of another
With proper restraint, alerting is a critical component of system monitoring. Alerts provide context to help teams solve an issue before it becomes a crisis.
Bloop Co is attempting to organize its alerts into the following categories:
2) Non-actionable monitorable events
3) Issues needing eventual resolution
4) Immediate response required
The auditing team sees an alert that triggers when the homepage loading time increases to multiple seconds. Which category would this alert fall under?
See the answer!Immediate response required! No user wants to browse a page if it's taking many seconds to load. It would be an unpleasant experience for them.
When an alert comes in, we want it to directly lead to solving a system problem. This is the realm of observability, discussed next.