Flagship 2024 will be available on-demand shortly – Click here to register and receive updates.

Glossary

Power Analysis

Power analysis is the process of estimating how many users you will need in order to detect an effect of a given size, or how small an effect you can detect.

Statistical power is the probability of detecting an effect, given that the effect is really there. Power has a strong relationship with the number of observations you are able to make: the more users you send through an experiment, the smaller the detectable effect you can reliably observe. Power analysis is the process of estimating how many users you will need in order to detect an effect of a given size, or how small of an effect you will be able to detect, given an expected number of users passing through.

Why Do a Power Analysis

A power analysis can help you in several ways. It can help with determining how many people or users are needed when you want to detect the effect of a given size. It can also be used to help determine power, based on the effect size and the number of people available. For example, let’s say you have 100 people available, and you would like to know if you have enough power to make a solid business case to do a study. If you’re underpowered and don’t have enough people to do the study, then there’s no point in conducting the study.

A power analysis can help you in several ways. It can help with determining how many people or users are needed when you want to detect the effect of a given size. It can also be used to help determine power, based on the effect size and the number of people available. For example, let’s say you have 100 people available, and you would like to know if you have enough power to make a solid business case to do a study. If you’re underpowered and don’t have enough people to do the study, then there’s no point in conducting the study.

Basic Inputs to a Power Calculation

How do you calculate Power? The basic inputs would be the baseline value of a metric (i.e. the value of the metric for the status quo or control experience). If you work out the smallest effect that you can detect, then you’ll want to work out the sample size or enter the effect you’re aiming for. You’ll also need to know the rollout percentage. So, maybe you’re targeting 50/50 or 10/90. If you’re going to measure the means metric, then you’ll also need the standard deviation (variance) of that metric.

You’ll either start with knowing what you want to achieve a detectable difference and you’re going to ask how many users will that take? Or you might say, “If I run this experiment for two weeks and I’ll have x number of users, then what’s the minimum detectable effect that I can pick up?”

More users make it easier to detect more subtle differences. If the feeling is that there’s going to be a massive change, then maybe a lot of users aren’t necessary. However, if one is going to optimize something that’s been around for a while, then there may be a need for a lot of users.

Duration is another thing to consider. Not only when am I going to achieve power, but when am I also going to get a representative set of users. So, if, for example, you want to go, typically for a week, because you’re going to get weekends and weekdays, depending on the traffic you’re going to get. A lot of people use two weeks because it gets you through two sets of weekdays and two sets of weekends.

Default Significance Threshold

If the p-value is smaller than the predefined statistical threshold, then you can conclude that the null hypothesis (both treatments have the same impact on the metric) is incorrect. A commonly used value for the significance threshold is 0.05 (5%), which means every time you experiment, there is a 5% chance of detecting a statistically significant impact even if there is no difference between treatments.

There are statistical settings within the Split platform for this, which can be modified. How and why would you modify them? We have two statistical settings. The first one is the confidence level or alpha. This refers to how sure you need to be before we say something is statistically confident. You can use 95% confidence because it’s the industry standard. If you wanted, you could increase that to 99% confidence, or decrease that to 90% confidence. And so that means, you just need to see a bigger difference for a result to be significant, if you increase that to 95% to 99% or 95% to 90%. And it’s easier to reach significance.

The reason you may want to change that usually depends on the cost of a false positive. If you’re about to invest loads of time and money into iterating something, then you want to be really sure that you see an impact, and that it’s not just noise. So, then you might want to increase that up to 99 percent confidence. If you’re just making a quick change, and you just want a best-guess scenario, and it’s not going to be too costly, then you might want to move that down to say 90 percent or even 80 percent. This will give you more power to detect the difference for the rest.

The other statistical setting we have is something that we call the power threshold, which is often also called beta. This tells us the minimum detectable effect when we’re working out the sample size and how likely it is that we would be able to detect that.

Default Power Threshold

It is desirable to have a high statistical power so that the test has a high chance of identifying a real difference in conversion rates and yields fewer false negatives. A commonly used power for statistical power is 80%, which means the experiment has an 80% chance of detecting a difference if one exists. A higher power, assuming all else being equal, requires more to achieve a conclusive impact.

Limitations of a Power Analysis

There are some limitations to power analyses. One of which is that they don’t generalize well. The methodology matters, and if you change the way you collect data, or you change the procedure for analyzing the data, then this affects the power analysis and you will have to redo your experiment.

In another case, you may have a power analysis that suggests an insufficient number of people for an experiment. For example, let’s say for your logistic regression, you need 40 people, and like all procedures that come with the maximum likelihood, they require a greater sample size. The most limiting aspect is that the power analysis may give you the best estimate of the number of people to detect the effect. In many cases, this is based on an assumption or educated guess. If any of these are wrong, then you have less power to detect the effect.

And lastly, with power analysis, you often get a range of people needed, not an exact number. Let’s say you are not sure about the standard deviation for your measured outcome. So, you guess the value, then run the power analysis and x number of subjects are determined. Then you guess a value that’s slightly higher, run the power analysis again, and get a slightly higher number of needed people. You try the process again. Only this time you run the process over a range of values that makes sense to the standard deviation. This gives you a range of the number of people you need for your experiment.

Want to Dive Deeper?

We have a lot to explore that can help you understand feature flags. Learn more about benefits, use cases, and real world applications that you can try.

Create Impact With Everything You Build

We’re excited to accompany you on your journey as you build faster, release safer, and launch impactful products.

Want to see how Split can measure impact and reduce release risk? 

Book a demo