Last week, we hosted Grier Johnson, a Platform Engineer at Square and formerly LinkedIn and Yahoo!, for a Tech Talk on Metrics and Monitoring Infrastructure. Grier brings a lot of experience and care to this topic. I will highlight a key theme in his talk that is applicable to every engineer: Alerting is for humans.
You can look at the slides to get an in-depth look. From this axiom, come a few design principles for producing alerts or for designing an alert system: alert on symptoms, not causes.
It is more valuable to be alerted on user centric symptoms, e.g. latency, inability to access a page or module, form submissions erroring out etc. than to be alerted on the underlying cause: server failing to restart or Kafka queue being backed up etc.
Let’s take a concrete example: At my previous company, RelateIQ, we wrote a crawler for our customers’ GMail accounts. As the system scaled, it occasionally ran into situations where the Kafka queues being used under the hood would get backed up. A cause based alert would alert on the queue being backed up above a threshold. It is useful, but it does not capture the symptom: users seeing emails show up in GMail, but not in RelateIQ. A symptom based alert would involve a blackbox process that reported on emails not showing up in RelateIQ within a threshold.
Symptoms based alerting is more valuable because it captures what is most important to your users.
Moreover, imagine getting an alert: “User emails are showing up in GMail but not in RelateIQ” vs. “Kafka queue backlog above threshold”. The former gives far more insight and urgency than the latter.
This thought is flushed out in detail by Rob Ewaschuk, a former SRE at Google, in this article.
As with all rules, there are exceptions to this rule. Cause based alerting is useful when catching the symptom would have been too late! In that case, alert on causes so you can know that things are about to get bad.
When creating your next alerting rule, remember that alerting is for humans, so alert on symptoms not causes.
We’ll be posting more on this topic ‘Alerting is for Humans’ in future posts.
Stay up to date
Don’t miss out! Subscribe to our digest to get the latest about feature flags, continuous delivery, experimentation, and more.
At Split, we “dogfood” our own product in so many ways. Our engineering and product teams are using Split nearly every day. It’s how we make Split better.
A/B testing is a powerful tool for learning about your users, understanding your features’ impact, and making informed business decisions. To ensure you make the best decisions and are extracting the most insights from your experiments, some experimental design guidelines are essential. These guidelines can be cumbersome or confusing at…
Feature flags provide so much for software organizations: they allow teams to separate code deployment from feature release, test in production, run experiments, and more. However, some rules apply to the feature flagging process that are easy for teams to overlook. I’ve gathered the best practices of feature flags from…