Glossary

Observability in Software Development

With the unknowns of your software’s failure modes, you want to be able to figure out what’s going on just by looking at the outputs: you want observability.

Return to Glossary Read More Below

Observability is defined as the ability of the internal states of a system to be determined by its external outputs.

“Observability” is the hot new tech buzzword. But is this actually a new concept, separate from monitoring? Or is it just a fancy new term? Today, we’ll be explaining observability: what it is, how it differs from monitoring and alerting, and why you should care.

One of the benefits of working with older technologies was the limited set of defined failure modes. Yes, things broke, but you would pretty much know what broke at any given time, or you could find out quickly, because a lot of older systems failed in pretty much the same three ways over and over again.

As systems became more complex, the possible failures became more abundant. To try to fix this problem, we created monitoring tools to help us figure out what was going on in the guts of our software. We kept track of our application performance with monitoring data collection and time series analytics. This process was manageable for a while, but it quickly got out of hand.

Modern systems — with everything turning into open-source cloud-native microservices running on Kubernetes clusters — are extraordinarily complex. Further, they’re now being developed faster than ever. Between CI/CD, DevOps, agile development, and progressive delivery, the software delivery train is speeding up.

With these complex, distributed systems being developed at the speed of light, the possible failure modes are multiplying. When something fails, it’s no longer obvious what caused it. We cannot keep up with this by simply building better applications. Nothing is perfect, everything fails at some point, and the best thing we can do as developers is to make sure that when our software fails, it’s as easy as possible for us to fix it.

Unfortunately, many modern developers don’t know what their software’s failure modes are. In many cases, there are just too many. Further, sometimes we don’t even know that we don’t know. And this is dangerous. Unknown unknowns mean you won’t put effort into fixing the problem, because you don’t know it exists.

Standard monitoring — the kind that happens after release — cannot fix this problem. It can only track known unknowns. Tracking KPIs is only as useful as the KPIs themselves are relevant to the failure they’re trying to detect. Reporting performance is only as useful as that reporting accurately represents the internal state of your system. Your monitoring is only as useful as your system is monitor-able.

This concept of monitor-able-ness has a name: observability.

Implementing Observability

The key tools for implementing observability are metrics, logs, and tracing.

Metrics are a central part of any monitoring process, but even when you have the right ones, you’re necessarily limited by the constraints of linear time. People decide on metrics based on failures they’ve already found and fixed in the past. But there may be unknown unknowns: failures you haven’t seen before, and therefore can’t anticipate. Preemptively checking your metrics to find patterns is an option, but this isn’t a replacement for being able to come back quickly from a failure. In short, metrics are necessary, but not sufficient.

While metrics should be constantly tracked, you only look at logs when your metrics are showing something strange you’d like to investigate. They’re more specific and detailed than metrics, and they exist to show you what happened in each event. Having understandable, queryable, comprehensive logs is a significant component of what separates the observable from the non-observable system.

Tracing is really just a type of logging that’s designed to record the flow of a program’s execution. Typically, tracing is more granular than standard logging: while logs may say that a program installation failed, a trace will show you the specific exception that was thrown and when during the runtime it happened. Tracing is frequently used to detect latency issues or find out which of many microservices is not working. It’s especially useful for error detection in distributed systems, to such an extent that this use case has its own name: distributed tracing.

The biggest problem with all logging, including tracing, is that the volume of data storage becomes prohibitive, fast. Sampling is a possibility, as was implemented in Google’s Dapper project, but it’s not a perfect solution. For one thing, sampling is not easy or simple: different logs may need to be sampled in different ways and the sampling strategy will need to change over time. For another, sampling is too rigid for some use cases. So while it may be tempting to be like Google, using Google’s development processes is only reasonable for companies on the same order of magnitude as Google – if you’re smaller, you have much more flexibility.

Different companies implement observability differently. Some track dozens of metrics and some track only a few; some keep all their logs and some down-sample them aggressively. Which solution works for you depends heavily on your company, your system, and your resources. But one thing is clear: observability is a real thing, it’s important, and systems that implement it from the get-go will be uniquely situated to spring back quickly from failure when it happens.

Switch It On With Split

The Split Feature Data Platform™ gives you the confidence to move fast without breaking things. Set up feature flags and safely deploy to production, controlling who sees which features and when. Connect every flag to contextual data, so you can know if your features are making things better or worse and act without hesitation. Effortlessly conduct feature experiments like A/B tests without slowing down. Whether you’re looking to increase your releases, to decrease your MTTR, or to ignite your dev team without burning them out–Split is both a feature management platform and partnership to revolutionize the way the work gets done. Schedule a demo or explore our feature flag solution at your own pace to learn more.

BOOK A DEMO

Want to Dive Deeper?

We have a lot to explore that can help you understand feature flags. Learn more about benefits, use cases, and real world applications that you can try.

Blog

Experimentation, Integration

More Powerful Experiments and Personalization at Scale with Amplitude and Split

4 minute read

View Blog

Blog

Features

Find which feature caused an error with our new Sentry Integration

3 minute read

View Blog

Blog

Code

Build an API with Node.js, Express, and TypeScript

23 minute read

View Blog

Create Impact With Everything You Build

We’re excited to accompany you on your journey as you build faster, release safer, and launch impactful products.

Free Account Contact Us

Search site

Why Split

Products

Feature Delivery & Control

Feature Measurement & Learning

Enterprise Readiness

Related Links

Use Cases

By Need

By Industry

Resources

Developer Resources

Content Hub

Success

Related Links

Pricing

Company

Search site

Observability in Software Development

Implementing Observability

Switch It On With Split

Want to Dive Deeper?

Split Experimentation for Azure App Configuration Now in Public Preview

Introducing Switch, Split’s New AI Developer Assistant

Experimenting With Statistical Rigor to Make Data-Driven Taco Decisions

Create Impact With Everything You Build

Want to Dive Deeper?

More Powerful Experiments and Personalization at Scale with Amplitude and Split

Find which feature caused an error with our new Sentry Integration

Build an API with Node.js, Express, and TypeScript

Create Impact With Everything You Build

Feature Delivery & Control

Feature Measurement & Learning

Related Links

By Need

By Industry

Developer Resources

Content Hub

Success

Related Links

Observability in Software Development

Implementing Observability

Want to Dive Deeper?

Create Impact With Everything You Build

Want to Dive Deeper?

Create Impact With Everything You Build

Want to see how Split can measure impact and reduce release risk?