Last week, we hosted Grier Johnson, a Platform Engineer at Square and formerly LinkedIn and Yahoo!, for a Tech Talk on Metrics and Monitoring Infrastructure. Grier brings a lot of experience and care to this topic. I will highlight a key theme in his talk that is applicable to every engineer: Alerting is for humans.
You can look at the slides to get an in-depth look. From this axiom, come a few design principles for producing alerts or for designing an alert system: alert on symptoms, not causes.
It is more valuable to be alerted on user centric symptoms, e.g. latency, inability to access a page or module, form submissions erroring out etc. than to be alerted on the underlying cause: server failing to restart or Kafka queue being backed up etc.
Let’s take a concrete example: At my previous company, RelateIQ, we wrote a crawler for our customers’ GMail accounts. As the system scaled, it occasionally ran into situations where the Kafka queues being used under the hood would get backed up. A cause based alert would alert on the queue being backed up above a threshold. It is useful, but it does not capture the symptom: users seeing emails show up in GMail, but not in RelateIQ. A symptom based alert would involve a blackbox process that reported on emails not showing up in RelateIQ within a threshold.
Symptoms based alerting is more valuable because it captures what is most important to your users.
Moreover, imagine getting an alert: “User emails are showing up in GMail but not in RelateIQ” vs. “Kafka queue backlog above threshold”. The former gives far more insight and urgency than the latter.
This thought is flushed out in detail by Rob Ewaschuk, a former SRE at Google, in this article.
As with all rules, there are exceptions to this rule. Cause based alerting is useful when catching the symptom would have been too late! In that case, alert on causes so you can know that things are about to get bad.
When creating your next alerting rule, remember that alerting is for humans, so alert on symptoms not causes.
We’ll be posting more on this topic ‘Alerting is for Humans’ in future posts.
Stay up to date
Don’t miss out! Subscribe to our digest to get the latest about feature flags, continuous delivery, experimentation, and more.
With feature flags, you can control the percentage allocation of users you want to be exposed to a specific feature. This process provides risk mitigation and confirms both usability and scalability. Canary releases, or controlled rollouts, serve as an added layer of protection in case something goes wrong. What is…
Feature flagging is a technique development teams deploy to enable easy switches between codepaths in their systems, at runtime. In simpler terms, they’re control structures that toggle on and off the code inside them. Dev teams use feature flags for a wide variety of purposes, from canary releases to A/B…
If our organizations want to survive this period of political and economic uncertainty, they must be able to move with speed and adaptability.