Alerting is for Humans

April 19, 2016

Last week, we hosted Grier Johnson, a Platform Engineer at Square and formerly LinkedIn and Yahoo!, for a Tech Talk on Metrics and Monitoring Infrastructure. Grier brings a lot of experience and care to this topic. I will highlight a key theme in his talk that is applicable to every engineer: Alerting is for humans.

You can look at the slides to get an in-depth look. From this axiom, come a few design principles for producing alerts or for designing an alert system: alert on symptoms, not causes.

It is more valuable to be alerted on user centric symptoms, e.g. latency, inability to access a page or module, form submissions erroring out etc. than to be alerted on the underlying cause: server failing to restart or Kafka queue being backed up etc.

Let’s take a concrete example: At my previous company, RelateIQ, we wrote a crawler for our customers’ GMail accounts. As the system scaled, it occasionally ran into situations where the Kafka queues being used under the hood would get backed up. A cause based alert would alert on the queue being backed up above a threshold. It is useful, but it does not capture the symptom: users seeing emails show up in GMail, but not in RelateIQ. A symptom based alert would involve a blackbox process that reported on emails not showing up in RelateIQ within a threshold.

Symptoms based alerting is more valuable because it captures what is most important to your users.

Moreover, imagine getting an alert: “User emails are showing up in GMail but not in RelateIQ” vs. “Kafka queue backlog above threshold”. The former gives far more insight and urgency than the latter.

This thought is flushed out in detail by Rob Ewaschuk, a former SRE at Google, in this article.
As with all rules, there are exceptions to this rule. Cause based alerting is useful when catching the symptom would have been too late! In that case, alert on causes so you can know that things are about to get bad.

When creating your next alerting rule, remember that alerting is for humans, so alert on symptoms not causes.

We’ll be posting more on this topic ‘Alerting is for Humans’ in future posts.

Get Split Certified

Split Arcade includes product explainer videos, clickable product tutorials, manipulatable code examples, and interactive challenges.

Switch It On With Split

The Split Feature Data Platform™ gives you the confidence to move fast without breaking things. Set up feature flags and safely deploy to production, controlling who sees which features and when. Connect every flag to contextual data, so you can know if your features are making things better or worse and act without hesitation. Effortlessly conduct feature experiments like A/B tests without slowing down. Whether you’re looking to increase your releases, to decrease your MTTR, or to ignite your dev team without burning them out–Split is both a feature management platform and partnership to revolutionize the way the work gets done. Switch on a free account today, schedule a demo, or contact us for further questions.

Want to Dive Deeper?

We have a lot to explore that can help you understand feature flags. Learn more about benefits, use cases, and real world applications that you can try.

Blog

Code

Deploy Your React App with Netlify

View Blog

Blog

Code

Build a Web App with Spring Boot in 15 Minutes

View Blog

Webinar

Company

Flagship 2021: Beyond CI/CD with Azure Pipelines

View Webinar

Create Impact With Everything You Build

We’re excited to accompany you on your journey as you build faster, release safer, and launch impactful products.

Free Account Contact Us

Search site

Why Split

Products

Feature Delivery & Control

Feature Measurement & Learning

Enterprise Readiness

Related Links

Use Cases

By Need

By Industry

Resources

Developer Resources

Content Hub

Success

Related Links

Pricing

Company

Search site

Alerting is for Humans

Get Split Certified

Switch It On With Split

Want to Dive Deeper?

Introducing Switch, Split’s New AI Developer Assistant

Experimenting With Statistical Rigor to Make Data-Driven Taco Decisions

Rethinking DORA: Mean Time to Restore

Don’t Fear the Percentage-Based Rollout

Influencing Without Authority Is All About Aligning Incentives

The Lifecycle of Software Releases Explained

Release New Features Faster

Want to Dive Deeper?

Deploy Your React App with Netlify

Build a Web App with Spring Boot in 15 Minutes

Flagship 2021: Beyond CI/CD with Azure Pipelines

Create Impact With Everything You Build