Observability is defined as the ability of the internal states of a system to be determined by its external outputs. With the unknown unknowns of our software’s failure modes, we want to be able to figure out what’s going on just by looking at the outputs: we want observability.
“Observability” is the hot new tech buzzword. But is this actually a new concept, separate from monitoring? Or is it just a fancy new term? Today, we’ll be explaining observability: what it is, how it differs from monitoring and alerting, and why you should care.
One of the benefits of working with older technologies was the limited set of defined failure modes. Yes, things broke, but you would pretty much know what broke at any given time, or you could find out quickly, because a lot of older systems failed in pretty much the same three ways over and over again.
As systems became more complex, the possible failures became more abundant. To try to fix this problem, we created monitoring tools to help us figure out what was going on in the guts of our software. We kept track of our application performance with monitoring data collection and time series analytics. This process was manageable for a while, but it quickly got out of hand.
Modern systems – with everything turning into open-source cloud-native microservices running on Kubernetes clusters – are extraordinarily complex. Further, they’re now being developed faster than ever. Between CI/CD, DevOps, agile development, and progressive delivery, the software delivery train is speeding up.
With these complex, distributed systems being developed at the speed of light, the possible failure modes are multiplying. When something fails, it’s no longer obvious what caused it. We cannot keep up with this by simply building better applications. Nothing is perfect, everything fails at some point, and the best thing we can do as developers is to make sure that when our software fails, it’s as easy as possible for us to fix it.
Unfortunately, many modern developers don’t know what their software’s failure modes are. In many cases, there are just too many. Further, sometimes we don’t even know that we don’t know. And this is dangerous. Unknown unknowns mean you won’t put effort into fixing the problem, because you don’t know it exists.
Standard monitoring – the kind that happens after release – cannot fix this problem. It can only track known unknowns. Tracking KPIs is only as useful as the KPIs themselves are relevant to the failure they’re trying to detect. Reporting performance is only as useful as that reporting accurately represents the internal state of your system. Your monitoring is only as useful as your system is monitor-able.
This concept of monitor-able-ness has a name: observability.
The key tools for implementing observability are metrics, logs, and tracing.
Metrics are a central part of any monitoring process, but even when you have the right ones, you’re necessarily limited by the constraints of linear time. People decide on metrics based on failures they’ve already found and fixed in the past. But there may be unknown unknowns: failures you haven’t seen before, and therefore can’t anticipate. Preemptively checking your metrics to find patterns is an option, but this isn’t a replacement for being able to come back quickly from a failure. In short, metrics are necessary, but not sufficient.
While metrics should be constantly tracked, you only look at logs when your metrics are showing something strange you’d like to investigate. They’re more specific and detailed than metrics, and they exist to show you what happened in each event. Having understandable, queryable, comprehensive logs is a significant component of what separates the observable from the non-observable system.
Tracing is really just a type of logging that’s designed to record the flow of a program’s execution. Typically, tracing is more granular than standard logging: while logs may say that a program installation failed, a trace will show you the specific exception that was thrown and when during the runtime it happened. Tracing is frequently used to detect latency issues or find out which of many microservices is not working. It’s especially useful for error detection in distributed systems, to such an extent that this use case has its own name: distributed tracing.
The biggest problem with all logging, including tracing, is that the volume of data storage becomes prohibitive, fast. Sampling is a possibility, as was implemented in Google’s Dapper project, but it’s not a perfect solution. For one thing, sampling is not easy or simple: different logs may need to be sampled in different ways and the sampling strategy will need to change over time. For another, sampling is too rigid for some use cases. So while it may be tempting to be like Google, using Google’s development processes is only reasonable for companies on the same order of magnitude as Google – if you’re smaller, you have much more flexibility.
Different companies implement observability differently. Some track dozens of metrics and some track only a few; some keep all their logs and some downsample them aggressively. Which solution works for you depends heavily on your company, your system, and your resources. But one thing is clear: observability is a real thing, it’s important, and systems that implement it from the get-go will be uniquely situated to spring back quickly from failure when it happens.