How to A/B Test Every Feature Release

With continuous delivery practices often comes an acceleration of the feature release cadence. Feature flags provide additional control, separating feature release from code deployment. However, with all these additional releases, how are you measuring the impact of this accelerated engineering work?

In this webinar, hear from Trevor Stuart, Head of Product at Split Software about why you need to think about experimentation if you are moving toward continuous delivery. Kevin Li, Solutions Architect at Split Software will provide a detailed technical overview and demonstration of the Split Feature Experimentation Platform.

Watch this webinar to learn:

  • Why it is important to have comprehensive measurement in your CI/CD process
  • How other solutions fall short in providing insights on every feature release
  • Technical information on how Split works with a live demo

Video Transcript

Hello everybody and welcome to today’s presentation. How to a test every feature release brought to you by split software. Before we get started I’d like to take care of a couple of housekeeping items. First off I’d like to make sure you can all hear me. OK. So if you would send me a quick message in the chat window to let me know you can hear me. Great. If you have any questions for our presenters today please post those into the questions panel so that we can get to them after the presentation. If you experience any technical difficulties like the audio dropping out or the screen sharing not displaying properly please post the message in the chat panel. To me the host and I will work with you to resolve those issues. Lastly on the housekeeping list we are recording today’s presentation and we’ll be making it available shortly after today’s event concludes. Now that we’re through the housekeeping items I’d like to introduce you to our presenters Jason Miyasato is the senior director of demand and growth marketing and Split software. Trevor Stuart is the head of product at split software and Kevin Li is a Solutions Architect at split software. So Jason Trevor and Kevin if you all could say hello to the audience and audience if you could let me know if you can hear our presenters by sending us a message in the chat window. Hello audience. How’s it going this is Trevor here. And this is Kevin. OK great. Well without further ado I’m going to turn things over to Jason to begin the presentation – Jason. Thanks Ryan. In

[00:01:48] this webinar We’re going to highlight the evolution of software development and why it’s important to have comprehensive measurement and your CI/CD process. We’ll also talk about how many solutions fall short in providing insights on every feature release. And lastly we have the technical information on how split works with a live demo by Kevin. With that I’ll turn it over to Trevor Stuart. How’s it going. One of the cofounders here at Split and focused on leading our product development efforts. As Jason just talked through a we’re in talk there a few different things. One is the evolution of the industry. The other is just understanding the tools we’re using today and how that impact experimentation and then really look at two case studies, one is more popular and one is more notorious. And one you’ve never heard of a case that we recently ran on our user base. So first we’re going to quickly talk about some of the industry trends that we’re all really cognizant of and aware of. And the first one is just enhancing development velocity is ever increasing. So if you look in the 1990s Microsoft was shipping software every few years. In the 2000s we saw Salesforce come to market and start shipping software at twice the year high today peak performers like Netflix and LinkedIn are delivering hundreds of times if not thousands of times a day. Now with that change we’ve seen a way that we’ve seen a change in the way product teams operate. So for those of us that have been in the industry longer were familiar with the concept of waterfall releases but today those are now more familiarized as MVP based development.

[00:03:17] We used to spend hours writing product about product requirements docs and sending that over sending that over the wall. Today the feedback cycle based on rapid iteration with a lot of data coming in to understand how customers are using our product and giving us that rapid iteration and that customer feedback cycle such that they product development is now that has now continued setting up joint experiments between product and engineering teams, and in this new process measurement is paramount. So if we take a look here Microsoft Microsoft today runs thousands of experiments per year. And at Microsoft it is not uncommon to see an experiment that can impact annual revenue by millions of dollars. And sometimes that impact is actually measured in the tens of millions of dollars. But this measurement extends far beyond the first order. It’s not just about revenue. It’s the first second and third order metrics that also have a big impact. So for if you take a step back and look at Microsoft they’ve run a scenario every year where they do fail day and they go through and they slow down their servers and their machines understand the impact that it can have on their customer base and therefore on the customer experience and revenue. And then one of those experiments that they ran they were able to understand that of being engineer that improved server performance by 10 milliseconds. That’s 138 of the speed of which our eyes blink can more than pay for their fully loaded annual cost. So this really helps the tieback how our engineering and product efforts something as small as performance can have such a dramatic impact on a business top line. Now today everything can be measured.

[00:04:51] So for those of us on this call who have used many different tools we may be using tools like New Relic you maybe using tools like Pendo for some of your customer analytics or maybe you’re using an amplitude for product analytics. Maybe a year you’re doing billing information into Zuora or using Zendesk for customer support. So today everything can be measured there’s a lot of tools on the market that allow us to get that information and get that product telemetry. But understanding the causal impact requires tools that focus on experimentation. And this is our dashboard which Kevin will walk into later. But I want to quickly talk through a few of these tools and how they can help you with the concepts of experimentation can help you understand the impact when you segment out and roll out features across your customer base. So if you have tools like the war you can start to understand how the trial the paid conversion rates will fluctuate for users that are exposed to a particular treatment versus another variant or another treatment if you will like Pendo you can run surveys you can look at NPS scores or even you can just use their feature tracking software to really understand how customers engaged with the piece of functionality that you release to maybe 20 30 40 50 percent of your customers.

[00:06:02] If you’re using New Relic, you can use their insights tool to understand some of those really user monitoring metrics and the browser monitoring metrics around page load time and really understand what Microsoft looked to understand which is if you slow down the Web site how does that impact the higher order metrics so you can really start to understand that initial page at the time can have a significant impact. You can measure that across customers who are exposed to different pieces of functionality. You can use amplitudes similar to Pendo the more focused on the product analytics metric that those behavioral analytics to understand how users are creating or taking actions within your product. And the last one here is Zendesk. You can tieback those support tickets which is something we like to do here at split when we roll out a piece of functionality to a subset of our customer if we like to understand whether they are filing support tickets whether we’ve we’ve introduced a layer of confusion if there’s bugs that there’s anything that could be drawn back on that customer experience. So one or two seconds and just dive into two quick example. The first obviously is a little more popular and more famous but this was Twitter’s move to 280 characters. So here a quick quote from the press release that they put out in their blog. So they said in September we launched a test that expanded the character limit from 140 characters so that every person around the world could express themselves easily in a tweet. Their goal was to make this possible while also ensuring that they keep the speed and brevity that makes Twitter Twitter. So three primary goals were set out here. The first was how do we make it easy. And how do we make it by making it possible and how do we focus on speed and brevity.

[00:07:40] So they looked at a number of different metrics but the first one here was focusing on how do we make tweeting easier. So they found that 1 percent of Twitter actually hit the limit whereas before when they increased the limit it was normally it was 9 percent. So 9 percent of people with 140 characters were hitting the limit whereas when they increased it only 1 percent of tweeters were they felt like they were making it easier for people to tweet and really removing some of those barriers. But what they didn’t see or what they had hoped they wouldn’t see was that people would just continue to write really long messages. So they found that only 5 percent actually use the enhanced functionality so they actually started tweeting more than 140 characters and only 2 percent tweeted more than 190 characters. So while they made tweeting easier they still managed to keep the brevity and then the last one they continued to drive engagement. So what they found was that people who had more room to tweet it actually led to more likes more retweeting more mentions. So after letting that experiment run for about a month or so they then rolled that out to 100 percent of their user base. And this was something that was all over the press. You can read about in their blog. It was a big decision it was at the core of Twitter. It was at the core of their founding principles. Now if you look at one that we currently have running this thing they involve a split that we currently are running from a very small visual change but it grows dramatically different outcomes across our customer base.

[00:09:02] So in this example we have 2 treatments. We’d have a small treatment on the left and a large treatment on the right. And just looking in here it’s very very simple we’re just moving the star. We’re taking a small star on the left and we’re making it a bigger star on the right. And what we wanted to see when we earlier this functionality to our customer base. We wanted to understand discoverability as well as engagement. So these are two different starring paradigms from actually two different software development tools that I’m sure many of you are using. One either a small one of the large star and we really want to understand how customers engage with it. And what we found was well the small start, as you could imagine with actually harder to discover. So fewer users discovered or clicked the star when it was smaller. So percent of users using starts was 13 percent fewer using small stars. And number of users who had started it was about 18 percent fewer. But what we found was that they actually engaged with it a lot more. So for those users that had found the small star they drove up their engagement with that star and they had a lot more stars. So 54 percent more star as they click the star 10 more times 10 percent more times. They had on average 65 percent more stars per user. And the numbers are as per user with stars increased over 89 percent. And so this really helped us understand the discoverability in the reengagement with something super simple around the basic visualization. And we were able to collect that data.

[00:10:26] If we go back a few slides we were able to take the data from the tool that we use today. So we were able to take the data from the tool like Pendo who we use here but to be able to mesh that data into our experimentation analytics product to segment out our customers who are the treatments. Now in the foreground we also have additional metrics being calculated that were less relevant for that experiment but we also kept those metrics running such as pageload or time on site or number of affection per user. We have all of those metrics running so that we could understand the impact and make sure that we weren’t moving them in a negative way. So all of those metrics continue to run behind Split while the metrics that we cared about for this particular release were analyzed and helped inform our decision to roll that out to 100 percent of our customers. So with that I’m going to turn the floor over to Kevin who’s going to give you a quick demo of a little bit more into the Split console of how you would use split for experimentation.

[00:11:24] Perfect. Thanks Trevor. So that’s my face. You mentioned earlier I’m a solution architect here. So we’re with a ton of our customers from our customer success perspective walk to through best practices on setting up our platform. And as I mentioned earlier I’ll be walking through that demonstration of our product now before we dive in. I find it’s generally helpful just to set the stage really quickly. Talk a little bit about how our platform works at a high level and how it works with your product.

[00:11:50] So within this diagram you go ahead and see you’ve got the split platform on the right side there. And within that cloud you can see the management console called out. So basically our platform works with a web UI that allows any member of your team to go ahead and sign in and set up what we call rollout plans. So saying for a given future I want to be able to separate the concept of code deploy from from my feature release and turn that be strong for none of my customers turned on for only 10 percent of my customers maybe haven’t said so that only my customers in New York are seeing the on version and maybe I’m running and maybe into my customers in California were having them or seen on half of them are off all that granularity all that targeting set directly in our web UI. Now that we’ve actually works to control your customer experience, is that how you make those changes directly within our platform, all those roll out plans all that information is automatically saved and downloaded down to our SDKs which said that the application layer of your code base will be simple packages that you would install into your projects and what they do for you when you go ahead and start using those SDK client to be in your project that they’ll download down all those rollout plans for you and sort them down in memory.

[00:13:05] What that allows for them is for a full in-memory execution at runtime while you’re able to do it’s a wrap up feature in a simple if else if statement is the very simple concept of a featured flag and say I have an on version and an off version and I’m going to simply ask the split SDK at runtime to decide for me which version I should serve up that will be able to do that in memory on your side because it has in memory restore of your roll out plan and will be able to dynamically decide for you for giving user what version of that feature they should see. In this manner you’re able to separate that concept of code deploy from your feature release and only have one production branch of code that dynamically set to serve different versions of your features two different subsets of your customer base. Now what we’re doing on the flip side of that as we go ahead and serve up those versions of your features as we captured down that information for you. But who’s seeing on who’s seeing the off version of your feature. We call those impressions. Those data points simply get passed back up into our platform. So it’s very easy to understand within our web UI again now how many people have seen a given version of my feature. How many times have I served up the on version of that feature or are also able to do it ingest any generic event data that you might care about.

[00:14:21] As Trevor was talking about the clips the engagement production KPIs as well in terms of number 500 errors or the load of my application can all be simply pumped into our platform and within our back end we are going to automatically join that generic event data directly with that impression data so that we can turn every single feature that you pump into the platform into an experiment where we’re automatically going to tell you if a specific feature is impacting any of the given event streams that have come in. So for example if your tracking clicks on your home page you can go ahead and understand within our platform. If the on population for a given feature is seeing more clicks than the population and if that difference is statistically significant. Essentially is there a causal impact that can be attributed to the specific feature that you’ve rolled out. Now that we’ve gone through the high level architecture what I’ll do here guys is go ahead and dive into the platform itself. So now what I’m bringing up is the web UI and while we’re currently looking at is the paid for a given split within the platform. So we split that object up that a code change that you’re managing a roll out plan associated with it and then how that code changes impacting metrics that you care about. And so the Split that we’re looking at is one called new onboarding flow. So this is an example we’re ruling out the new onboarding flow for all of our users. It’s a fairly simple use case and one that we’re likely all familiar with now for a given speech that you’re rolling out work given split within the platform. There are two main components of that split definition and results. The definitions simply being the definition of that roll out plan and the result being how that feature the impact here metrics. So starting here on the definitions I would take a quick look here and take a look at how we go ahead and enable all that targeting for you and your customer base.

[00:16:26] And so as we take a look here at the first section you’re going to see at the top here is a treatment section that treatment section and is simply where you’re able to go ahead and define the two different versions of your feature that the customer might see. In this case, we’ve defined that feature to have an on and off version but you can go ahead and added many treatments as you want. So if you have different versions or variants of your feature you can go ahead and define all of those here. Now below that you have a number of ways in which to control who can see a given version of your feature. So you can go ahead and whitelist individual customers simply by putting in their ID into a free form text editor here and that will automatically tell our SDKs to put those customers into the on version, or we could say I am a high risk customer we’ll then put them into the office version as well. Down below you also have the ability via our targeting rules get much more dynamic with that targeting so you can build out if else statements using really any matching criteria that’s available to your code at runtime to slice and dice your customer base into different treatment plans. So as an example here we’re targeting by location. And we’re saying only for our customers in New York I want to do an AB test where half of them on half of them will see off our platform takes care of that randomization for you. Now beyond just starting by I want to attribute this plus icon here and go ahead and add any other dimension I might want to target by as well.

[00:17:56] So I could go ahead and say hey maybe I want to look at last loggin as well and I can also now use a date tag matcher and go ahead. I want my I want to know if the last login was or after to say yesterday and go ahead and save and I’m slicing and dicing down to a further subset of the customer base. The last piece to call out here that I can of course target by any other attributes that I might want to using and else if here to target those different subsets of my customer base with different treatment plans. I’ll hop out of this at this point and take a look that once you’ve set up a feature to be rolled out how we’re helping you to measure those different components of that feature. So clicking over to the results tab of a given the split we can go ahead and see that the first thing you are going to be able to see visualize a view of those impressions, we were talking about earlier. These impressions are again that information about who seem to be on and who seemingly off the version of your feature saw the very top here you’re just going to see a graph where that information is all visualized and then below you can see the aggregated information there around the number of times a given version has been served up as well as that information there of the unique number of customers that have seen a given version. And then finally at the very bottom you have that full raw table available that will go ahead and show you all the times a given version, sorry a given customer has seen a given version of your feature.

[00:19:27] This is that basis then for measurement right we now have that bucketing information available to know who’s seen what version of a feature and we can use this information to really start measuring the effect of a given feature on the metrics that you care about. Now in terms of the measurement component and the way that works is that our platform also has the ability to go ahead adjust any events that are occurring within your platform directly into our platform. So I can click to now is a view of all event types that have been pumped into this specific instance of split. So here we can see that full listing out we have a very simple rest api as well as track method built directly into our SDK where you can pump these events directly into our platform. And once they show up you can go ahead and view all of them here. So for example we take a look here at one example maybe on that estore checkout right support is rolling out a feature and we also want to measure if there are any specific events coming through for people checking out we can see here that there was already an event pumped in. I just did that earlier this morning. You can see here pumped it in at 958 and there’s just a sample ID and then that event occurred. What you can do within our platform now is very simply I can define who we call a new metric and what to the telling split is how should our platform go ahead and activate and measure this specific event stream for you. In this case I might care about for example the the number of checkouts so I can tell you the account checkouts per user. I then describe it.

[00:21:02] I can then tag it, and within the definition here I can simply tell the platform how we want to do that aggregation and they do want to increase or decrease. I want more check out so increase when you’re across or per user. I’ll select per. And then obviously I want to like count for this time. We can see we also have the ability to count ratio between two different events streams some of values average values and the percent of unique users that triggered a given event. In this case I’ll select count. And then also elect that estore or checkout event that I pumped in and you go ahead and then hit Create as soon as I’ve done that. Now this metric has now been created and will start to be measured automatically by our platform. And the beauty it is that I only need to do this once when we go back to our new onboarding flows split we’re going to see that that this metric automatically been created because the way we approach it is that every single event stream is going to be measured across every single Split that you set up. So now coming back over here to that new onboarding flows split what I can do now is click over to metrics and we’re going to be able to go ahead and see the result of our experiment in terms of how this specific feature is impacting all of the metrics that we care about. So what you’re going to see now is a full metrics dashboard in terms of a comparison where we’re showing you how the on population has compared against that baseline of the off population.

[00:22:30] And within each of these metrics cards you’re just seeing that information where we joined that impression event stream up against the actual event stream for the thing happening with your application such as tasks being created. And what we’re doing is we know who in on population created tasks and who in our off population created a task and we’re just simply showing you the percent difference between those two different populations. Now please go to the bottom here. We can see that account of checkouts per user has actually already been created as a new metric because I was talking to. Obviously there’s no data available just because I pumped it in with that dummy ID for example IDs. So there’s nothing to map it to. But this doesn’t demonstrate that every metric you create gets measured across every single feature. Now in terms of looking at the other metrics actually have some data in them we can see that obviously there’s some color coding. So rather than just doing the peer comparison of those populations what we’re doing on top of that is also doing a statistical computation to determine if the differences is truly statistically significant, essentially can you determine if this given feature is causing a change in that metric. So anything highlighted here in green or red is where the tests are running for you is coming back statistically significant. So Green meeting that there is a positive change for your business.

[00:23:45] Red meaning that there was a negative change for your business and that these results were statistically significant so we can determine that this feature was actually causing the specific lift and dip in the metrics that we care about anything highlighted in black here is where our platform are so obviously showing you the percentage change in that given metric. But perhaps there was a ton of noise in the data that you can see here the error margin is plus or minus seventy point five percent. So even though the overall on average there might have been a change in that given metric if the error margin in there’s a ton of noise in the data we’re not gonna be able to determine that there is true causation coming from a given feature. And of course we’re also tell you if we need more data be able to determine if the result is statistically significant which is really giving you that it’s a full 360 degree dashboard. Where you are able to go ahead and get to that true causal impact truly able to understand how this new onboarding flow is moving the needle really not moving the needle in some cases. On the metrics that you care about so you can make a truly educated decision about how to think about rolling that feature out if you need to kill it. For now you need to revamp it and also how can this inform your future product roadmap right as we move towards a world where you’re doing MVP development. Okay. You might have thrown out this first iteration of that flow and you’ve seen some results. How do we need to revamp it go back to the drawing board and make changes to how we think about scoping the roadmap for that onboarding flow for our customers now that we have this true data about the causal impact of this feature.

[00:25:17] So as I mentioned earlier this is just one example of a given feature if we were to go to any other split within the platform we would see the same metrics dashboard coming up and what that means and really why the webinar is titled How to AB test every feature releases that every single feature that you walk through our platform automatically then becomes an experiment where you’re able to measure the impact of that given feature on all the metrics you care about. So that’s where I’ll generally pause for that demonstration of our platform at this point. We’d be happy to go ahead and go through some questions. Thank you for the time today. If this has been interesting to the team to you or your team obviously but have a 14 day free trial. So feel free to go ahead and sign up. It’s just Split.io/signup and obviously we look forward to chatting with you. Great thanks Kevin. Can we do have a few questions from the audience. The first one that came in is are metrics usually UI driven or path through app driven. So honestly I think the metrics can be seen really all across. Right. The whole point here that this is the full dashboard that is a call out that it’s not just sort of anything that’s on the UI. It can be anything that’s full stack as well as those production KPIs that you care about like load time or the number 500 errors.

[00:26:50] So I would say the metrics 100 percent should be driven via the UI but the whole beauty of our platform is that you’re going to be able to measure every single metric that you care about on one experiment and then you’re going to see the impact across all the different metrics that you might care about. So it doesn’t just have to be limited to the UI. So I would say yes definitely I’d say yes to both. And then you could be doing both with our platform. Great. Another question How does the software know about the user that use your location to deliver ab tests per user location and use the information from the data center. Yes. The way we’ve architected our platform is that within the UI all that’s happening is your defining a key value pair that you want are SDK to match on for you at runtime. So as I mentioned that SDK actually sits at the application layer of your code base and you’re simply asking it at runtime. Hey given that user what version of the feature should they see when you say given that you’re just passing that information to us. So this user would be there Id this might be their location that he will just use that information and match it against the key value pair you defined in the UI to make that computation and make that decision for you directly at runtime. The reason why we architected our platform in this manner is that this allows you to slice and dice by any dimension of your customer base that matters to you such as location or their last login or how much we’re spending with you. Right. But you never actually have to share that information with us because of all the information is actually just being used directly at runtime within your application. Next question how was the integration with API gateways like AWS?

[00:28:39] So as I mentioned our SDK and our platform generally sits at the application layer of the code base. So we sit higher up in the stack and we generally like to think of ourself from a CI/CD perspective sitting posts that whole deployment process. That’s why we talk about separating the concept of code deployed from our feature release. And so when you’re leveraging our platform of course you want to use split in terms of testing out different versions of the features in a pretty broad environment. But then once you’ve deployed everything into production that’s where Split really starts to shine because you’ve deployed out different versions of your feature. But you’re able to dynamically control them through a UI just because that information is automatically being used by SDKs. So generally speaking we wouldn’t integrate directly with your AWS environment. You wouldn’t be sort of say deploying different servers, different services with split. It’s more about the application layer is where you’d be leveraging up to serve up different pieces of functionality or different versions of your functionality within one branch and within one deployment of your application.

[00:29:47] Thanks Kein, another one how focused should the metrics be for testing. Yeah I’ll take this one. I think the metrics really extend beyond kind of set the metrics and there are some great papers here by Microsoft and the experimentation team which I refer the group to around how to categorize metrics. We wrote a book on experimentation we talked about some of these frameworks some of them are there’s one framework around heart which is happiness engagement retention tax and starting to understand both those metrics that are particular to the experiment.

[00:30:22] So in the case we talked about earlier starring we were looking at particular metrics to understand engagement of that particular experiment or that particular split. But there were a series of metrics underneath the hood that were not necessarily relevant for that particular experiment but we consider them as the company to guardrail metrics. Those metrics are things like page load times in our case. How often are split being edited. Are people being invited to the product. Those are metrics that we look at across all of our experiments to ensure that they are they are moving within bounds that we’re comfortable with. But then there are metrics that are particular to the experiment that you’re releasing and some of those may in fact be engagement with that particular feature or retention. But there is that layer that spans across your entire product.

[00:31:10] And the question is pretty easy when I think for you guys it’s how what’s the best way for me to get started with split. I’d say definitely sign up for that trial and as soon as you do that our team will be reaching out if you ever just want to reach out to us directly hello@split.io is a great e-mail alias we’re not we’re a number of our teammates are monitoring as well. I think we’ve answered most of the questions. Ryan I will turn it back over to you. All right great. Well I’d like to thank Jason Trevor and Kevin for a great presentation today. I’d also like to thank today’s sponsor split software for providing the audience with a great webinar.

[00:31:54] And lastly thank you to everyone who joined us for today’s presentation. We hope you learned something new today that will help you and your developer career have a great day and we’ll see you next time.