False Discovery Rate Definition
In technical terms, the false discovery rate is the proportion of all “discoveries” which are false.
When running a classical statistical test, any time a null hypothesis is rejected it can be considered a “discovery”. For example, any statistically significant metric is considered a discovery since we can conclude the measured difference is highly unlikely to be due to random noise alone and the treatment is directly influencing the metric. On the other hand, metrics which did not reach significance are statistically inconclusive — they are not a discovery as it wasn’t possible to reject the null hypothesis.
In the context of online experimentation and A/B testing, the false discovery rate is the proportion of statistically significant results which are false positives. Or to write it algebraically:
FDR = N_falsely_significant / N_significant
N_falsely_significant is the number of statistically significant metrics which were not truly impacted (false positives) and
N_significant is the total number of metrics which were deemed statistically significant.
For example, if you see 10 statistically significant metrics in your experiment and you happen to know that 1 of those 10 significant metrics was a false positive and wasn’t really impacted, that gives you a false discovery rate of 10% (1 out of 10). In this way the FDR only depends on the statistically significant metrics, it doesn’t matter if there was 1 or 1000 other statistically inconclusive metrics in the example above, the FDR would still be 10%.
Other Measures of Accuracy
Another common measure of the accuracy of a test is the False Positive Rate (FPR). This is the probability that a null hypothesis will be rejected when it was in fact true. In other words, it is the chance that a given metric, which is not impacted at all by your experiment, will show a statistically significant impact.
The important distinction between the false positive rate and the false discovery rate is that the false positive rate applies to each metric individually, i.e. each non-impacted metric may have 5% chance of showing a false positive result, whereas the false discovery rate looks at all hypotheses that are being tested together as one set.
The Family Wise Error Rate (FWER) is another measure of accuracy for tests with multiple hypotheses. It can be thought of as a way of extending the false positive rate to apply to situations where multiple things are being tested. The FWER is defined as the probability of seeing at least one false positive out of all the hypotheses you are testing. The FWER can increase dramatically as you begin to test more metrics. For example, if you test 100 metrics and each has a false positive rate of 5% (as would be the case if you use a typical 95% confidence or 0.05 p-value threshold), the chance that at least one of those metrics would be statistically significant is over 99%, even if there was no true impact whatsoever to any of the metrics.
When you are testing only one hypothesis (ex: a test with only one metric) these three measures will all be equivalent to each other. However it is when multiple hypotheses are being tested that they differ; in these situations the false discovery rate can be a very useful measure of the accuracy as it takes into account the number of hypotheses being tested, yet is far less conservative than the FWER.
The false discovery rate is a popular way of measuring accuracy because it reflects how experimenters make decisions. It is (usually) only the significant results — the discoveries — which are acted upon. Hence it can be very valuable to know the confidence with which you can report on those discoveries. For example, if you have a false discovery rate of 5%, this is equivalent to saying that there is only 5% chance, on average, that a statistically significant metric was not truly impacted. If you know your false discovery rate is 5%, you can rest assured that 95% of all the statistically significant metrics you see reflect a true underlying impact.
Controlling the False Discovery Rate
As well as simply measuring the accuracy of a test, there are ways to control and limit the accuracy to the desired rate through the experimental design. The false-positive rate can easily be controlled by adjusting the significance threshold that is used to determine statistical significance. Controlling the false discovery rate is more complex as it depends upon the results, which cannot be known in advance. However, there are statistical techniques, such as the Benjamini Hochberg Correction, which can be applied to your results to ensure that the false discovery rate is no larger than your desired level.