Site Reliability and Experimentation: They Go Hand in Hand

The first SREs, or site reliability engineers, were hired by Google to (as you might imagine) address site reliability challenges. Since then, the SRE role has evolved to encompass availability, performance, capacity planning, system design, monitoring, and alerting. The ultimate goal of an SRE team is to create an environment that is predictable and debuggable. That ultimate goal aligns perfectly with a robust feature delivery and experimentation program in which every new feature gets tested, and releases happen behind feature flags.

Feature Flags and Site Reliability

We encourage development teams to separate deployment from release so the SREs can manage the deployment of new code first, then monitor site reliability as users are gradually exposed to the new code/feature. If there is a problem, the dev or SRE team can kill the feature (Split has a built-in kill switch to make this simple and fast). This methodology replaces the “release and roll back” pattern that has burnt out many an SRE.

Dev and PM teams are focused on the release at hand and the goals of that release, so the metrics they watch most closely will relate to feature performance and business impact. This is great and precisely what they should be doing. SREs, instead, are the first line of defense in your organization and will define and track metrics that serve as guardrails for application health. Degradations and downtime are expensive, and a robust feature delivery platform gives SREs a finer degree of control for risk mitigation. If a feature or system is not behaving as expected, an SRE can turn off a flag, save time, and de-risk what otherwise would be a rollback.

That same feature delivery platform (a platform like Split!) can help SRE teams understand how code changes from new features change core site metrics like site responsiveness or error rates. Split ties changes in these metrics back to specific features within a particular release. Split can also alert SRE and dev teams to significant changes in a metric post-release.

Experimentation and Site Reliability

It’s likely starting to become clear how site reliability and experimentation are inextricably linked. In a nutshell, experimentation can give SRE teams clearer causality between a code change and site reliability metrics. Because each code change is an experiment in which only a quantified percentage of users are exposed to the new code, tracking down changes that impact site reliability becomes much more manageable.
Experimentation also allows SRE teams to detect impacts on the metrics they’re tracking earlier in the rollout process without being too noisy. Split detects significant changes on all metrics between users with the feature and the ones without, something your standard APM system cannot do alone.

An experimentation tool like Split will also give your SRE teams the ability to ingest and track crashes and exceptions when rolling out new features. All of these abilities make it easier to diagnose issues in real-time and provide comprehensive post-event reporting.

SREs Need Data to be Proactive

We can all agree that it’s better if our engineering teams are proactive vs. reactive in nearly every case. With a strong feature delivery and experimentation platform in place, your SREs will have proactive alerts at their fingertips and immediately know when a metric is degraded as a result of a feature flag change.

Split integrates natively with Google Analytics and Sentry. We can also ingest data from most systems your SRE team will care about, including APM, logging, observability, etc. Alongside our integrations, we offer a public API to ingest any additional data that needs to be linked to a user and the ability to create alerts for any and every metric your teams care about.

Learn More about Site Reliability, Feature Flags, and Experimentation

As you’ve now heard, Split’s experimentation platform can constantly monitor and drive performance improvements across key metrics such as load times and responsiveness with every feature rollout. If you’d like to learn more about site reliability and performance or how Split supports other engineering teams, we encourage you to check out these resources:

And, as always, we’d love to have you follow along with the content we’re producing. Check us out on Twitter and LinkedIn, and subscribe to our channel on YouTube!