Last week, we hosted Grier Johnson, a Platform Engineer at Square and formerly LinkedIn and Yahoo!, for a Tech Talk on Metrics and Monitoring Infrastructure. Grier brings a lot of experience and care to this topic. I will highlight a key theme in his talk that is applicable to every engineer: Alerting is for humans.
You can look at the slides to get an in-depth look. From this axiom, come a few design principles for producing alerts or for designing an alert system: alert on symptoms, not causes.
It is more valuable to be alerted on user centric symptoms, e.g. latency, inability to access a page or module, form submissions erroring out etc. than to be alerted on the underlying cause: server failing to restart or Kafka queue being backed up etc.
Let’s take a concrete example: At my previous company, RelateIQ, we wrote a crawler for our customers’ GMail accounts. As the system scaled, it occasionally ran into situations where the Kafka queues being used under the hood would get backed up. A cause based alert would alert on the queue being backed up above a threshold. It is useful, but it does not capture the symptom: users seeing emails show up in GMail, but not in RelateIQ. A symptom based alert would involve a blackbox process that reported on emails not showing up in RelateIQ within a threshold.
Symptoms based alerting is more valuable because it captures what is most important to your users.
Moreover, imagine getting an alert: “User emails are showing up in GMail but not in RelateIQ” vs. “Kafka queue backlog above threshold”. The former gives far more insight and urgency than the latter.
This thought is flushed out in detail by Rob Ewaschuk, a former SRE at Google, in this article.
As with all rules, there are exceptions to this rule. Cause based alerting is useful when catching the symptom would have been too late! In that case, alert on causes so you can know that things are about to get bad.
When creating your next alerting rule, remember that alerting is for humans, so alert on symptoms not causes.
We’ll be posting more on this topic ‘Alerting is for Humans’ in future posts.
Stay up to date
Don’t miss out! Subscribe to our digest to get the latest about feature flags, continuous delivery, experimentation, and more.
Why build or buy a robust feature flags as a service platform when you could just implement a simple toggling mechanism as a one-off and be done with it?
Blind spots have a way of coming back to bite you down the road. When building software, the size of that “bite” grows exponentially with time.
Split hits 1T feature flags served per month, and analyzes how COVID-19 has accelerated trends in digital transformation and feature experimentation.