Samsung recently pulled the plug on its Galaxy Note 7 phones after failing to fix the random combustion problem plaguing them. On a recent flight I was on, an annoucement asked us to power down these devices for the safety of the flight. This week it’s even more official: it’s now a crime to bring one on an airplane, even turned off and in checked luggage. For Samsung, this is a painful blow to revenue and customer trust.
Delivering great hardware has always been difficult. Once the unit is shipped, it cannot be fixed remotely. As software engineers at cloud companies we have it a bit easier. We can improve or fix our products whenever we want, usually without user interaction. However, there are some lessons we can draw from the Samsung example to help us improve software delivery.
#1 Use Controlled Rollouts to test features and quickly remediate issues.
The delay between the first reports of trouble and the device manufacturer’s response encouraged confusion and rumor, making it difficult to take stock of the true impact. It also meant that more phones were continuing to be sold exposing more customers to the issue. The lesson: getting in front of problems before more customers experience them is incredibly important.
Similarly, it’s better to discover a failure in new code early, by exposing it to a small, targeted percentage of users rather than risking them all. This concept of a gradually ramping release is called ‘Controlled Rollouts’ (CR).
There are two popular ways of achieving CR: canaries and CR platforms.
A canary is a dedicated machine(s) in your production cluster on which you can deploy new code. Using a load balancer, you can serve a percentage of production traffic with new code and the remaining with old. Canaries are easy to set up with most cloud providers, but the idea of targeting particular users with the feature doesn’t exist: the load balancer is simply delivering new code to a portion of traffic, regardless of what type of user they are.
CR platforms, like Split, let you deploy new code to an entire production cluster and be much more granular in targeting customers with the change, like 20% of ‘trial’ customers based in Los Angeles, for instance. Anyone on the team, from PMs to SREs, can target a whitelist, segment, or percentage of customers when releasing new features. Another distinction: CR platforms live within app, post-deployment, and unlike canaries do not use load balancers.
Samsung could never use CR to fix a hardware problem (though, if it turns out the cause of the battery fires was software-induced, that might be another story); cloud software companies however, should never roll out software to all of their customers at once.
#2 Kill features, not products.
Emergency fixes - like the ones Samsung did - are high stress situations in which engineers can’t take stock of the bigger picture. The result is a fix that works most of the time, but occasionally compounds problems into a snowball effect that leads to killing an entire product, just like the Note 7.
In software, good monitoring can help you pinpoint the exact change causing a failure. A good CR platform will give you new insights into feature launch and user experience, so you can make these correlations. Instead of doing emergency fixes, stabilize the system by rolling back - aka killing - that change. In CR, this is equivalent to dialing the code down to 0% of production traffic. Without CR, you can achieve this by doing a code rollback.
#3 Identifying who has a problem is just as important as knowing why.
Failures happen; that’s a fact of software. But knowing who experienced the failure can be an early insight into why the failure happened.
For something like a phone, that can be very hard to figure out based on over-the-counter hardware sales. In software development, we don’t have to suffer from that problem, though many of us do. Logging who saw a new feature isn’t always seen as a priority when you’re trying to rush it into delivery, and often many new features are bundled together before they ship, making it difficult to easily and quickly pinpoint which might be the problem.
The ability to release discreet features to targeted groups of users means their experience, good or bad, can be tied directly to a unique treatment. Using a CR platform can also help you uniquely log a feature impression for each user experiencing the new feature, so you can correlate experiences with problems at the user-level, using the analytics products of your choice.
Take advantage of the safety measures available to software developers.
Degrading a user’s experience can leave an unforgettable blemish on your product—driving away prospects and customers and generating bad press and word-of-mouth for your brand. Adopting a CR approach to feature release makes it much easier (and faster) to prevent these problems or solve them when they do arise. This understanding is exactly why we built Split into the platform for controlled rollout. If you’d like to try it yourself, you can do so for free, and if you have any questions just drop us a line at email@example.com.