Big News! Split to be acquired by Harness. Read More

Track Unlimited Metrics with the Same Statistical Rigor


Split - BLOG-Top3flags@2

A/B testing is a powerful tool for learning about your users, understanding your features’ impact, and making informed business decisions. To ensure you make the best decisions and are extracting the most insights from your experiments, some experimental design guidelines are essential. These guidelines can be cumbersome or confusing at times, which can lead to re-tests, which take even more time — or even lead you to make the wrong decisions! Luckily, there are statistical techniques that can take care of some of these issues so that you are protected from common A/B testing pitfalls.

Error Control

The beauty of frequentist statistics, such as the t-test, which underpins Split’s statistical analyses, is that it can provide strong and clearly defined guarantees on error rates. By setting the desired level of statistical significance (ex: .05, or 95% confidence), the experimenter has complete control over the chances of seeing a statistically significant result when there was no real impact (i.e., seeing a false positive). This allows you to choose the balance between confidence and time to significance that is right for you. The lower the error rate, or equivalently the higher the confidence which you require, the longer your metrics will take to reach significance. On the other hand, if getting results fast is important and you can tolerate a lower level of confidence, choosing a lower threshold for statistical significance (ex: 0.1 or 90% confidence) will decrease your metrics’ time to significance and allow you to detect smaller impacts.

It’s essential to keep in mind that this error rate applies to each metric. The more metrics you have, the higher your chances of seeing a false positive result. In some cases, this can make decision making difficult, and it’s not always clear how testing multiple metrics impacts the chances of seeing a false positive.

A standard recommendation is to select one metric to be the key metric for your experiment and to conclude the success of your test primarily on this metric alone (ex: a primary key performance indicator, or primary KPI.) If other metrics appear significantly impacted, a cautious approach would be to re-run the experiment again, but with those metrics acting as a key metric the next time. This allows the experimenter to ensure they have no more than a fixed chance of making an incorrect decision because the error rate is tightly controlled when concluding on a single metric. (At Split, this chance is 5% by default.) While this approach is academically sound, in practice, this can be time-consuming and overly conservative.

Experimenters more commonly limit the overall number of metrics for analysis or consider a subset of metrics to be more relevant to the tested change. The experimenter must then exercise their judgment on whether a highly unexpected impact may be a false positive and simply due to randomness in the data.

At Split, we intentionally calculate the results for every single one of your metrics for every test, so that you do not need to choose a subset or manually attach metrics to experiments. This is because features can have unexpected impacts, perhaps in a different part of the funnel or on performance metrics (such as page load times.). This broad coverage can be hugely valuable, and experimenters shouldn’t need to limit the number of metrics they’re testing.

Multiple Comparison Corrections

Multiple comparison corrections are a set of statistical techniques that automatically adjust your results to control the overall chances of error, regardless of how many metrics you are testing. This means there’s one less thing to worry about — you don’t need to limit the number of metrics you’re testing or to make any subjective judgment calls about whether or not an unexpected change may be a false positive.

At Split, we support a type of multiple comparison correction which controls the false discovery rate. Unlike false positive rate, which refers to the chance of a metric deemed statistically significant when there was no true underlying impact, the false discovery rate instead refers to a statistically significant metric being a false positive. If you consider each statistically significant metric a ‘discovery,’ then false discovery rate control limits the proportion of all of your discoveries which are false.

We chose to implement this type of correction because we believe it is more tightly coupled to how experimenters make decisions. It is only the statistically significant metrics which are reported as meaningful detected impacts of your test, hence it is most valuable to know that you can be confident in each of those reported discoveries.

When the multiple comparison correction is applied, using the default significance threshold of 5%, you can be confident that at least 95% of all the changes without meaningful impacts doesn’t incorrectly show as statistically significant. This guarantee applies regardless of how many metrics you have. If you need more confidence or faster results, you can adjust the significance threshold setting to higher or lower confidence, respectively.

To provide these strong guarantees on the error rate, you may find some metrics take longer to reach significance. For this reason, we treat your ‘key’ metrics and your ‘organizational’ metrics separately to give you additional flexibility. If you care about a specific small number of metrics, you can set these as key metrics for your experiment. They will not be penalized for having a large number of other metrics calculated for your organization, and they will reach significance faster when set as a key metric. If you chose a single metric to be a key metric, the correction will not affect it’s time to significance.

Calibrating Statistical Rigor to Business Needs

One of the most valuable aspects of the experimentation statistics we use at Split is the ability to control the likelihood of a false positive in your results. There isn’t a one size fits all answer. Different businesses may find they need different confidence levels depending on factors like risk tolerance, typical traffic sizes, and acceptable lengths of time to wait for results.

However, as online experimentation advances and with our customers measuring more and more metrics every day, there is a need to support solutions more tailored to making business decisions and to remove any concern over measuring too many things.

With multiple comparison corrections, you can add as many metrics as you can come up with, and know that you can still report any statistically significant changes with the same fixed level of confidence.

For more quick reads on A/B testing and experimentation, check out the posts below.

As always, if you’re looking for more great content like this, we’d love to have you follow us on Twitter @splitsoftware and subscribe to our YouTube channel.

Get Split Certified

Split Arcade includes product explainer videos, clickable product tutorials, manipulatable code examples, and interactive challenges.

Deliver Features That Matter, Faster. And Exhale.

Split is a feature management platform that attributes insightful data to everything you release. Whether your team is looking to test in production, perform gradual rollouts, or experiment with new features–Split ensures your efforts are safe, visible, and highly impactful. What a Release. Get going with a free accountschedule a demo to learn more, or contact us for further questions and support.

Want to Dive Deeper?

We have a lot to explore that can help you understand feature flags. Learn more about benefits, use cases, and real world applications that you can try.

Create Impact With Everything You Build

We’re excited to accompany you on your journey as you build faster, release safer, and launch impactful products.