You can look at the slides to get an in-depth look. From this axiom, come a few design principles for producing alerts or for designing an alert system: alert on symptoms, not causes.
It is more valuable to be alerted on user centric symptoms, e.g. latency, inability to access a page or module, form submissions erroring out etc. than to be alerted on the underlying cause: server failing to restart or Kafka queue being backed up etc.
Let’s take a concrete example: At my previous company, RelateIQ, we wrote a crawler for our customers’ GMail accounts. As the system scaled, it occasionally ran into situations where the Kafka queues being used under the hood would get backed up. A cause based alert would alert on the queue being backed up above a threshold. It is useful, but it does not capture the symptom: users seeing emails show up in GMail, but not in RelateIQ. A symptom based alert would involve a blackbox process that reported on emails not showing up in RelateIQ within a threshold.
Symptoms based alerting is more valuable because it captures what is most important to your users.
Moreover, imagine getting an alert: “User emails are showing up in GMail but not in RelateIQ” vs. “Kafka queue backlog above threshold”. The former gives far more insight and urgency than the latter.
This thought is flushed out in detail by Rob Ewaschuk, a former SRE at Google, in this article.
As with all rules, there are exceptions to this rule. Cause based alerting is useful when catching the symptom would have been too late! In that case, alert on causes so you can know that things are about to get bad.
When creating your next alerting rule, remember that alerting is for humans, so alert on symptoms not causes.
We'll be posting more on this topic 'Alerting is for Humans' in future posts.