Shift Right with Feature Flags: Best Practices for Testing in Production

Continuous integration emphasizes test execution early and often, providing rapid feedback to development teams for each new code commit. The shift-left testing mantra is ‘test early, test often’.

Shift-right extends application testing into the production environment. While monitoring and alerting systems aim to detect application issues before customers experience a change in service, shift-right testing focuses on prevention by conducting functional and performance tests with real-life traffic, systems, and data.

Testing in production (TiP) sounds inherently risky, but the use of feature flags and controlled rollouts has made shift-right testing achievable and increasingly common as software development teams strive for rapid iteration and continuous delivery.

We typically see two viewpoints for TiP with feature flags based on the business. B2B companies will whitelist the QA team only, with no customer exposure initially. A B2C company with a large user base may go one step further and expose new features to a tiny (1%-5%) segment of customers.

Benefits of using feature flags

Split customers that regularly use feature flags to test in production have described several key benefits. First, some things are difficult to test in a pre-production environment, such as network latency or the response time of a specific endpoint. Plus, IT organizations can avoid the expense of creating a duplicate A/B testing environment behind the firewall. In a controlled rollout, the shift-right mantra is ‘fail faster and at minimal cost’.

Staging environments are notoriously difficult to keep accurately configured, and test teams run the risk of executing tests on incorrect configurations. Security concerns related to sampling and copying of production data are also eliminated, and engineers are no longer tasked with the management and maintenance of the staging environment.

When using feature flags, test teams don’t have to wait for an environment to become available, removing an all-to-common wait-state from the value stream. Expediting the handoff from dev to test efficiently moves binaries down the continuous delivery pipeline.

The feature branching strategy also can be improved with TiP. When engineers don’t have to wait for tests to pass in staging they can more quickly merge their feature branch into the main branch by putting new features inside feature flags. This results in fewer code conflicts because the feature branch isn’t getting too far behind the main/trunk. Once new code has passed unit tests, engineers are able to immediately push the feature into production to remove a major time blocker and lowers the bar for production readiness.

Best practices for testing in production

Split’s feature flags are used across a broad range of industries, establishing the foundation for continuous delivery. In our work with both B2B and B2C customers, we’ve outlined these best practices for using feature flags in support of TiP:

Activate feature flags in pre-production first

No, we are not advocating throwing your existing CI best practices out the door. Our recommendation is to always turn a feature flag on in pre-production environments instead of jumping directly into production. Dev-level testing such as unit testing and testing of classes for base functionality should remain in a pre-production environment. The Split SDK provides an Off-the-Grid mode to support local unit testing.

Put all features behind a feature flag

For many continuous delivery organizations, managing all their code with feature flags is now standard practice. Product changes are no longer “launched” or “released”. Feature flagging separates the concept of code deploy from feature release.

At the very least, every new API endpoint should be behind a feature flag, especially external APIs. If testing your web service with production traffic identifies a new API change slowed response times or taxed infrastructure, you can easily hide that endpoint until the problem is resolved. A fundamentally new piece of functionality should always be behind a feature flag.

Plus, having an on/off switch speeds rollbacks for resolving issues that occur during testing or after the phased rollout begins. Anything that causes a problem for customers or increases support calls can be immediately turned off. Delivery teams can be confident that they can quickly revert back to a previously successful application behavior. With feature flagging, MTTR = just one click.

Individual feature testing

Avoid the urge to test out combinations of all feature flags that are in flight. Simply test the treatments of a single feature in isolation unless two features occur on the same page. Testing the combined explosion of multiple feature flags sounds good in theory, but in practice, we find it is not needed. Instead, it is best to test each feature flag in isolation for all states of the feature (i.e. on or off).

If you do want to test the multiple states of a feature flag inside the unit test suite, Split provides a programmatic testing mode. Instead of passing multiple config files, we prefer a single file to test a variety of feature options for consistency. We use Java internally so have already done the heavy lifting, and you may want to replicate this for your own language as well. Details on this can be found here.

Phased feature release

Once TiP has passed test criteria, choosing the percentages and timing of the phases of your feature rollout will be a balance of quality, risk, and desired time to market. For some of our customers, a jump to 25% exposure, then 50%, then 100% is acceptable for higher risk functionality. For changes considered very low risk, they will ramp even faster and rapidly clean up the split. We like to say that TiP reduces the ‘blast radius’ of any mistakes.

Customer benefits of testing in production

When talking to our customers about a potential shift to TiP with feature flags, we often cite this example from one of our e-commerce customers, thredUP:

Before deploying Split, the launch of a massive new piece of their platform e.g. a new shopping cart procedure or a change in the way product reviews are submitted, required setting up a war room where engineers and product managers monitored the release. With TiP and controlled rollouts, this is a thing of the past.

Now they first push new functionality into production for internal testing. Once confidence is established, they will increase exposure to a small percentage of customers for a few weeks. This gives them time to ‘work out the kinks’ (or turn feature flags off if something goes awry). They no longer have the nail-biter experience of ‘did we configure this correctly for production??’ By giving themselves the time to test in production, they have changed entire team’s mentality towards releasing major functionality.

Discover how you can securely release, target, and iterate with confidence with Split!