Holdout Tests: Your Strategy for Accurate Metrics

Blog > Code

March 24, 2022

Experiment results put together are a lot more than my actual growth

Suppose you have been running experiments for a while now. You must have noticed something strange: put together, the impact of all the experiments (2% + 1.5% + 0.8% + etc.) add up to much more improvement in conversion, a lot more than what your analytics report over the same period. What gives?

This discrepancy is fairly normal and expected. To avoid that, we recommend reporting the impact of an experiment as an interval rather than a single number, often labeled the average treatment effect (ATE). With a month-long test, the exact value will not be certain, and it might change over a period of time with incremental lift. For example, in a email marketing campaign where you send direct emails with different messaging as a marketing strategy in order to optimize the retention of your users. Different strategies might have additive, incremental and interaction effects over a period of time when you compare all the different messaging experiments with the baseline So what happens is that some experiments have a positive impact and are not very lucky: they appear non-significant. We recommend you re-run a similar idea with an improved implementation. It also means that a moderately good, but not great, idea can be lucky and have a significant result. Selecting for significant results catches good ideas and lucky ones whose uplift was randomly higher.

If you remeasure the impact of both, many of the lucky ones won’t be as lucky and will often show less of an impact. Not every test result will come up short the second time around, but many (often most) will have less impact the second time around. This is known as the regression toward the mean or shrinkage. It is a concern. This is the reason we recommend being mindful when sharing experiment results. In particular, we recommend against adding up average effects naïvely.

Measuring Twice: One Per Experiment, Once Overall

Is it worth running the same experiment twice, then? Well, not really. Our test set-up tries to minimize the cost of running one experiment and maximize the likelihood that the variants selected are the most impactful ones. But you can run a second experiment in parallel to all the others.

Every quarter, you can keep a small group of users, say 5 or 10%, and hold them out of all experiments as a holdout group or control group. This is done with an all-encompassing asymmetric split. From January 1st to May 31st, they will see your website as it was at the end of the previous year. It may be frustrating for those users and may affect customer experiences, but the learnings will be valuable in the long run.

With the other 95 or 90% of the audience, you should run all your tests in this test group, experiments, decisions, roll-out, and re-run experiments as usual. On July 1st, you should have a holdout control 5% segment of new users that has not seen any change in six months, and all the rest (95%) that has experienced a product that gradually improved over six months.

You can now run an experiment between those two groups to measure the overall impact of all the changes that you rolled out during Q1. This is known as the victory lap. This test can last for several weeks, and hopefully, you will rapidly measure a significantly positive change. The idea is to compare that number with the results of all the changes you have made so far as a validation.

What Can I Do With Two Different Numbers?

So, around late July-early August, you should have multiple individual experiments with positive test results (we ignore the changes that were not rolled out) and one overall holdout test. The numbers likely won’t match, at least not perfectly. What can we do?

The holdout should give a rough second measure for the actual impact of all the tests put together. Once you have it, compute the ratio of that estimation over the total from the overall holdout experiment. That ratio estimates how much shrinkage your current set-up produces:

If it’s about half or a third, it can seem dramatic but you actually are in a reasonable situation. Some results might be less impactful than estimated, but most are likely good ideas. Not all changes are as impactful as expected, but most should be in the right direction.
On the other hand, if you have ten times less impact or less, you might want to re-think your experiment power, false-positive rate, and overall methodology because you might be letting through not-so-great ideas. Your test might not have the impact that you think; the significant differences might have been false positives; many might not even be going in the right direction

Can I Run Historical Comparisons?

You can also compare your historical metrics with current ones to the holdout test. However, keep in mind: if some metrics are significantly worse than six months ago, that can be either because of your recent changes, or because of changing circumstances.

A common example of changing circumstances when launching a new product is gradually expanding your circle of interest: the first people who register for your product or service, test your video game, subscribe to your magazine will be from a core of very engaged users. As you increase your audience, you might reach out to less motivated users. Those will be more expensive to reach, harder to keep and more opportunistic. If that’s the case, you might release many changes that are consistently A/B tested as improving all your metrics—at the same time as those metrics go down.

A holdout can help you tell the difference. If the impact on your metrics is not significantly negative between the holdout controls and the majority of the traffic, you can likely exclude worse off releases. If the metrics are significantly worse six months ago compared with the holdout control, you know that those were affected by external changes.

What About Using Many Different Metrics?

If you run experiments on several different metrics —and you should— then you want to compute the estimated impact of all the experiments. How should you do that?

That means including the non-significant impacts: “non-significant” doesn’t mean that the impact is null, just that we cannot exclude the possibility that it is. Absence of evidence isn’t evidence of absence. That also means including the impact on experiments that had a different objective. If your change was to improve conversion on the homepage, but have increased customer satisfaction too, that change is part of the impact of your test too.

Should I Add Up Numbers Straight Up?

As I suggested in the first paragraph of this post, the naive approach is to pick the single estimation coming out of the experiment, the Average treatment effect (ATE). If you do it like that, you will have surprising, likely overestimated results. Instead, I would recommend adding up the effect with their uncertainty, and of the selection bias.

The easiest way to do that is by running Monte-Carlo simulations based on the confidence intervals of the results. For each metric, for each experiment, pick a random value around the measured impact. For each metric, add up the impact of all released tests. If you do that enough times, maybe a thousand, you will have a distribution of the likely overall impact of all your rolled-out changes.

You can also include a selection effect in your simulation. That will make the overall model harder to represent, but improve the realism.

Costs and Risks

The main issue with a holdout is that you can maintain technical debt until the end of the six-month test period. That could be a source of concern if you have meaningful changes that you want to introduce. You can always exclude certain changes from the holdout if they are deemed important for other reasons, like information security. Just expose the holdout control to the new experience.

Split uses category-based feature-flags to control your segments. That architecture allows you to redefine the target and apply those changes after starting the experiment.

Learn More About the Holdout Pattern

When you run multiple experiments, their overall effect is not just the sum of all the average estimated impacts. All changes might interact with each other: profit and conversion rate multiply, acquisitions channel compete. The effect that we measure can be overestimated and shrink when we measure it again.

To understand how all the changes work together, and control for shrinkage, we recommend that you exclude a minority of users (say 5%) from all changes for three or six months as long-term control, known as a holdout. This is done with an all-encompassing split and A/B test. You can run all individual feature tests on the majority of users (say 95%) that are not held out. Roll-out the successful tests to that 95% of users, until the end of the holdout period.

At the end of the holdout period, you can compare the 5% holdout control with the 95% segment with all the released changes for a few weeks—the victory lap. This will re-measure the actual impact of all your changes. Remember to switch the holdout to the new experience when your experiment is over. After more than six months of not seeing any change, they should be relieved and excited to see an improved product!

Switch It On With Split

The Split Feature Data Platform™ gives you the confidence to move fast without breaking things. Set up feature flags and safely deploy to production, controlling who sees which features and when. Connect every flag to contextual data, so you can know if your features are making things better or worse and act without hesitation. Effortlessly conduct feature experiments like A/B tests without slowing down. Whether you’re looking to increase your releases, to decrease your MTTR, or to ignite your dev team without burning them out–Split is both a feature management platform and partnership to revolutionize the way the work gets done. Schedule a demo to learn more.

Get Split Certified

Split Arcade includes product explainer videos, clickable product tutorials, manipulatable code examples, and interactive challenges.

Want to Dive Deeper?

We have a lot to explore that can help you understand feature flags. Learn more about benefits, use cases, and real world applications that you can try.

Blog

Experimentation