Software development in 2020 is a rapidly changing environment. It’s becoming more clear by the day that if our organizations want to survive this period of political and economic uncertainty, they must be able to move with speed and adaptability. But how do you control for safety at speed? If your goal is to move from a software delivery cycle that was built with lots of guardrails and infrequent deploys, to a more modern methodology that supports responsiveness via frequent releases to flow value to your customers as quickly and safely as possible, then who you put in charge of change management decisions matters, a lot.
Traditionally in many organizations, these decisions are made centrally by a change manager who conducts periodic change advisory board (CAB) meetings. The CAB itself is a collection of representatives from various functions inside and outside of corporate IT, chartered with reviewing proposed changes and assisting the change manager in change assessment, prioritization and scheduling. By design, this is a group that is not directly involved in designing or implementing the proposed change, which is why the process they conduct is known as an “external” review.
But, who should be in charge? In order to best support both speed and safety AS WELL AS business impact, you want your development teams and your customers driving your decision making. From a technical perspective, your developers have a better idea of what’s in a release, whether interdependencies exist, and whether it’s been tested, than any change advisory board. From a business perspective, it’s time we realized that the customer is the final arbiter of what will and won’t work, not some senior manager, and most certainly not your change advisory board.
A Change Advisory Board May be Slower, but at Least it’s Safer, Right?
The change advisory board (and much of what you’ll find in ITIL), might seem like reasonable ways to reduce risk and add rigor, especially in slow-moving, complex environments. If you only have one shot every 30 or 90 days to change production, keeping things out of releases with quality gates and management approvals looks like a good way to manage risk. Nobody would debate that it’s a slower process (which can add weeks or months to the time between a developer writing code and it going live), but it’s safer, right?
Actually, no. It’s not.
A multi-year study across a wide variety of industries and environments found that having a CAB or similar “external approval” process performed worse than having no approval process at all. Here’s what the lead author Dr. Nicole Fosgren had to say about the results:
We found that external approvals were negatively correlated with lead time, deployment frequency, and restore time, and had no correlation with change fail rate. In short, approval by an external body (such as a manager or CAB) simply doesn’t work to increase the stability of production systems, measured by the time to restore service and change fail rate. However, it certainly slows things down. It is, in fact, worse than having no change approval process at all.
What’s Better than No Change Management at All?
Dr. Fosgren and her team aren’t actually suggesting you have no approval process, but given how badly CABs perform in the real world, we should rethink what sort of process will actually keep you safe.
Here is another quote from the Accelerate study:
Our recommendation based on these results is to use a lightweight change approval process based on peer review, such as pair programming or intrateam code review, combined with a deployment pipeline to detect and reject bad changes. This process can be used for all kinds of changes, including code, infrastructure, and database changes.
Peer Code Reviews Combined with What Now?
When I read that recommendation in March of 2018, I had no problem visualizing the first half: peer code reviews and weekly team demos were already the norms at BlazeMeter. The second half? Not so much! We were a “modern” cloud-native SaaS app, but I had never heard of “a deployment pipeline [that could] detect and reject bad changes.”
To manage risk during releases back then, we used a combination of blue/green deployments and canary releases. The platform did the “pushing” but not the “detecting” or “rejecting” — that was up to our best and brightest SRE’s, who would check the health of dozens of things manually before we ramped up to 100% of users. Reading “detect and reject bad changes” only brought up a picture of a black box on a whiteboard labeled, “magic happens here.” Flashbacks to film studies in college, where I learned what “deus ex machina” meant.
Detect and Reject: Examples Are Out There If You Look
A lot can change in two years. Since I joined Split as a CD Evangelist, I’ve watched Lukas Vermeer describe how Booking.com’s deployment pipeline can detect and revert a bad change inside of a single second. I’ve listened to Sonali Sheel of Walmart Labs explain how they use their Expo platform to stop a rollout mid-way before it does damage to their key metrics, something they call Test to Launch.
The Split Feature Delivery Platform was created by people with similar visions of modern change management based on lived experiences at LinkedIn, Salesforce.com and Yahoo. The founders got advice from people who had tackled the complexity, speed, and scale of similar systems at Amazon and Microsoft.
When I heard about Split and discovered the examples they were commercializing, I jumped at the chance to join them. Building such a platform in-house was something only a few giants had the time and resources to tackle. If the same sort of fine-grained exposure and automated impact detection were available as SaaS, then impact-driven software delivery could be unlocked for every team, not just the unicorn startups.
That’s Amazing, but It Won’t Work Here
A funny thing happened over and over as I traveled to tech conferences, giving talks about the early pioneers that had built these platforms in house. Audiences would respond first with amazement and then with resignation. “That’s pretty cool, but we aren’t Booking.com, LinkedIn ,or Facebook. We can’t do that in our environment.”
In my efforts to be “vendor-neutral” by avoiding the specifics of Split’s implementation, I had practically re-drawn that fuzzy picture of a black box marked “magic happens here.”
What If It Wasn’t So Hard?
If you’ve never seen a platform that provides fine-grained control of exposure and a rigorous automated mechanism for detecting and rejecting bad changes in your environment, you can’t be blamed for thinking it’s science fiction.
It turns out there are just four main problems to solve here:
- Decouple deploy from release, so code can be pushed all the way to production but prevented from execution until ready. This facilitates true continuous integration, small batch sizes, incremental feature development, and branching by abstraction, which are all critical to pulling off continuous delivery where flow is the norm.
- Selectively expose the new code, starting with small internal audiences and working outward. This facilitates testing in production, dogfooding, early access programs, and batching of changes for change-adverse customers.
- Ingest system and user behavior data and align it with the exposure data indicating who is in and out of each user cohort. The goal is to make the attribution process (aligning “Who got what?”, with “Then this happened”) automatic and continuous.
- Compare the patterns of metrics between those included and excluded from each cohort to identify (and optionally alert on) significant differences. You may have seen this pattern before in the context of “A/B testing” which typically tracks the impact of changes on a conversion metric, but here we are talking about broadly tracking the impact of all engineering work and having an ever-vigilant watch for impacts on all organizationally valued metrics, whether impact on those metrics is expected or not.
Progressive Delivery: An Easily Established Foundation
Decoupling deploy from release and selectively exposing new code are becoming known as progressive delivery, a term coined by RedMonk analyst James Governor. Multiple commercial feature flag vendors provide these capabilities out of the box and consensus is emerging that feature flags have joined the list of developer tools that make more sense to buy than build and maintain in-house.
Feature flags are an essential foundation for achieving flow, but by themselves they do not speed up the detection of impact. Most feature flag implementations make it easier to flow, but do nothing to indicate whether all is well or if you are achieving meaningful outcomes.
Head’s Up: Data Scientists Don’t Scale
Ingesting system and user behavior and automatically aligning it with the exposed cohorts is rare amongst all but the most sophisticated in-house systems. Most teams attempting this practice are doing a lot of manual and ad-hoc data science work. Since they are constrained by human resources rather than computing capacity, they are forced to pick and choose when and where to pay attention.
Cognitive load is not your friend when aiming for flow, so Split’s design doesn’t even require teams to choose which events to associate with each feature flag rollout; all ingested events, once tied to organizational metrics, are continuously attributed to the on and off cohorts of every rollout, without any intervention. Split also eases the work of identifying and ingesting event data through integrations with Segment, Sentry, mParticle, and Google Analytics.
Semper Fi for Continuous Delivery Pipelines
Comparing patterns of metrics between those included and not included in each cohort in a rigorous way to automatically determine significant differences is even more rare in the wild than attribution. This is exactly the problem that Split’s Monitor and Experimentation modules solve. Monitor focuses on identifying and alerting on impacts to metrics as a rollout is underway (also known as “limiting the blast radius of incidents”), while Experimentation, like A/B testing, seeks to provide a continuous source of unbiased data, not constrained by the availability of an analyst, to indicate whether each feature achieved a desired impact or not.
Better Together: Peer Review, Progressive Delivery and Automated Sense-Making
Why do we strive for flow? It’s not about output. It’s about outcomes. We strive for flow so that we can iterate the feedback loop of idea -> implementation -> observation with less friction in less time. Whether you call it “impact-driven development” or “customer-driven development” this approach to moving faster with greater (and faster) awareness of outcomes goes well beyond the “deployment pipeline to detect and reject bad changes” that the DORA team recommended that we combine with peer review practices. Yes, we can automatically detect and reject bad changes, but more importantly, we can build a repeatable process for triangulating towards meaningful business outcomes.
Learn More About Achieving Change Management and Flow, Together, in Continuous Delivery
- Watch a four-minute video on the definition of Continuous Delivery to see why small batch size is critical to achieving consistent flow.
- Pick up tips from multiple teams that ship to production daily in the O’Reilly e-book, Continuous Delivery in the Wild
- Watch an in-depth video with Craig Sebenik (LinkedIn, Crunchbase, Matterport) on the benefits of moving to trunk-based development.
Stay up to date
Don’t miss out! Subscribe to our digest to get the latest about feature flags, continuous delivery, experimentation, and more.
At Split, we “dogfood” our own product in so many ways. Our engineering and product teams are using Split nearly every day. It’s how we make Split better.
A/B testing is a powerful tool for learning about your users, understanding your features’ impact, and making informed business decisions. To ensure you make the best decisions and are extracting the most insights from your experiments, some experimental design guidelines are essential. These guidelines can be cumbersome or confusing at…
Feature flags provide so much for software organizations: they allow teams to separate code deployment from feature release, test in production, run experiments, and more. However, some rules apply to the feature flagging process that are easy for teams to overlook. I’ve gathered the best practices of feature flags from…