To Feature Flag is to Experiment

When we started Split one of the most common questions asked by investors, prospects, and friends was, “wait, can you build a company on feature flags?” As entrepreneurs, that question and the implied derision became the flame that powered us through the highs and lows of building Split.

Why Feature Flags Have Taken Off

Over the years, this question has gone away. This is partly due to the hard work of Split and LaunchDarkly – two leading vendors in this space – and partly due to the natural adoption curve of newly productized categories. In other words, the market is now ready:

  • Homegrown feature flags have been implemented by many engineering teams
  • Better automated testing, monitoring and observability have opened the door to testing in production
  • Software apps are better instrumented, with products such as Segment, Heap, and Amplitude, than ever before

The new question I get asked regularly is far more interesting: “is feature flagging the complete solution or is feature flagging a feature of a larger solution?”

I believe that feature flagging is the foundation of a larger category of tools needed by modern day development teams.

Let me explain. The most common interpretation of feature flags is a basic on/off flag that is either on or off for everyone. If you are really adventurous, you may add the ability to individually target specific users into the flag. All in all, this is a rather basic interpretation of a flag.

Using Feature Flags to Change your Release Paradigm

The more interesting variant of feature flags is “controlled rollouts”:  the idea of turning a feature on for a random percentage of users. More advanced versions of controlled rollouts involve customer dimensions. For instance:

// turn on the feature for 50% of users in california. For everyone 
// else, turn it on for 1% of users.
if user.state = ‘ca’ then split 50%:on,50%:off 
else split 1%:on,99%:offCode language: C# (cs)

Moving forward, I will use the terms controlled rollouts and feature flags interchangeably. 

Controlled rollouts are powerful. They allow product engineering teams to reduce the blast radius of mistakes, test in production, and generally bake their features and infrastructure without risking the user experience of every customer.

Controlled rollouts are also the most common variant of feature flags across Split’s customers. 

However, they also introduce an opportunity product engineering teams have never had before –  the ability to quantify the impact of a feature on engineering & product metrics without releasing it to 100% of users.

Let’s work through this a bit.

In a controlled rollout, a feature is first released to 1%-5% of users to understand if there were some obvious bugs, exceptions, or latency changes introduced by the feature. However, the mean or 95th percentile latency across the site will not change at such small exposure. In essence, APMs and exception tracking systems are not built to pick up the signal produced by a feature flag.

If there is no degradation in engineering operational metrics, the feature can be released to 25%-50% of users. Similar to APMs, Product Analytics systems will not pick the impact of the feature on user behavior metrics at that exposure level.

In other words, feature flags break both types of measurement: engineering operational metrics and product user behavior metrics.

Feature Flags + Controlled Rollouts + Measurement = Experimentation

So, what is the solution? The solution is to tie measurement to feature flags in a single integrated system. At their full maturity, such systems are capable of managing feature flags and running experiments. Pick any successful tech company and at the heart of their product engineering teams is a feature delivery platform.

They allow engineers to release a feature to 1% of users; automatically detect, alert, and kill the flag if page latencies or exception rates increase. They allow product managers to continue the release to 50% of users and learn whether the feature had the desired impact on user behavior or at least did not cause a degradation.

That is the real category feature flags create; one where you rapidly release, measure, and learn from your users through a unified solution for feature delivery.

In summary, feature flags are incomplete without controlled rollouts and measurement. By combining these three concepts, we create a powerful world where every feature is safe behind a flag, purposefully released to users, and quantified through metrics.