Feature Experimentation: Choosing the Best Route

By: David Martin

Imagine you’re a trip planner for the Tour Mont Blanc. Tour Mont Blanc (TMB) is a famous network of trails that wind through the alpine mountains and valleys of France, Switzerland, and Italy. Unlike hikes in the United States, ‘refuges’ dot the TMB. Refuges are hostels that provide dinner, breakfast, and warm beds for the night. You can plan a short day on the trail by choosing a refuge close by or give yourself a workout by picking one far away.

The trail is rough in most places, but the payoff is spectacular: mind-blowing vistas of snowy mountains and pleasant pastures. My recent trip to the TMB was a workout, but also one of the best experiences of my life. My photo above is of a truly pink church in Trient, Switzerland.

Trip Planner Imagine you have the job of planning TMB visits for hikers of various ages and abilities from around the world. If you do your job well, you’ll steadily get more business by contented word of mouth. You already know your approval ratings increase when you successfully match the route with hiker needs, and you carefully screen each client to be sure they are given an itinerary to suit their ability level.

How are you going to create new itineraries? You don’t have the time to build a custom itinerary for every client, not if you want your business to scale. How can you be intelligent — scientific even — about how you evaluate the new options to increase your approval ratings further?

The answer is feature experimentation.

Here’s how it works. You have a theory that if you give families a layover day, it will increase overall satisfaction with their trip. While you could start by just putting the layover day into all your family itineraries, you can also run an actual scientific experiment. Why don’t you offer half the families the layover day itinerary, and half the original (making them the control)? When they finish the hike, how are the approval ratings different?

This much is scientific in its approach, but not rigorous enough to be sound science. For example, if you get a 4.3 out of 5 rating from some families with the layover day, and 4.0 rating from some in the control group, you still can’t say if the difference correlates with your change. For this reason, you must take the full series of ratings from all the families and perform something like a t-test, which guards for the size of the sample set (too few families included in each sample set), the distance between the ratings (too small a difference in ratings), and more. This type of test is designed to prove that the difference in scores was indeed because the group of families offered the layover day behaved differently from the ones that did not. That’s sound science.

Armed with the results of the t-test, you can now say with a high level of certainty that the layover day was a success.

What if you have more ideas?

  • Will groups desiring a strenuous itinerary respond the best to 20, 30, or 40 km days? That’s multivariate feature experimentation, a juicy subcategory.
  • Will a stay at a different refuge produce a statistically significant change in satisfaction?
  • Should groups take a bus to bypass boring sections of trail, allowing them to hike more of a scenic area?

With the right feature experimentation approach, you can run tests for all of these questions. One consequence of this addictive approach to decision-making is its catalytic impact on innovation. If you’re much less afraid of making a wrong decision and have faith in your results, you are empowered to think fearlessly about the future.

Leaving the Tour Mont Blanc now, consider mobile, web, and really any software application. Determining the correct feature set is a Borgesian garden of forking paths. Unless you’re Steve Jobs, every authority advises against acting on instinct. How do you make better decisions when you know very well that you don’t know what your customer thinks, even with surveys and visits?

Answers have been emerging over the last decade. For example, you may say, “I already have feature flagging!” That’s fantastic!

But with our Tour Mont Blanc example, using feature flags is analogous to giving various groups different itineraries. It doesn’t imply any analytical approach to observing the behavior of the hiking groups to improve your decision-making.

In essence, you’re like the train controller that can open and close sections of track. After that, you’re probably just counting trains; the basics are in place, but the rich promise of the approach is unrealized.

“Still!” you say, “I could do the math myself.” True, implementing a t-test isn’t hard. But were you really random about how you selected hiking groups? Did you evenly distribute hiking groups when you determined which belonged to the control? If you did, were you sure that the assignment was sticky (the same hiking group didn’t get a different feature experience at each fork)? Did you allow for segmenting the hiking groups (e.g., a beta group)? Did you ensure that a refuge that could only appear on one route was only available for a stay when an earlier feature flag opened the trail to it? Is it lightning fast to check if you’re in the experiment or in the control? And does your solution degrade cleanly, guaranteeing the application will never break even if the experiment conductor is temporarily unavailable?

Take a look at the demo below of the TMB experimentation within Split to see how easy it is to implement feature experimentation.

If you’re making decisions and doing rollouts of complex software — frontend, backend, and everything in between — then you can benefit from using Split feature experimentation. Split covers you through feature flagging and all the way out to the rich mathematics that can and should be informing your next big decision.