8 minute read
How often do you build a product that you end up using every day? At Split, we “dogfood” our own product in so many ways that our engineering and product teams are using Split nearly every day. It’s how we make Split better. Using your own product as a tool to build your product gives you a front-row experience of how valuable your product is to your customers, how well it solves specific use cases, where the pain points are, and so much more.
I believe every software company should deploy feature flags in their product. Why? Because feature flags provide a safety net to make engineering teams more productive by allowing engineers to ship code faster, they open up the possibility of testing in production and enable devs and product teams to quickly kill any feature that causes product degradation, often in a matter of seconds.
Today, I’d like to walk you through a few of the ways we’re using feature flags at Split. Some of these will hopefully be familiar and obvious, but my hope is that others will give you ideas for new ways to drive efficiency, innovation, or simply product-market fit in your organization.
1. Testing in Production
We talk a lot about testing in production because it’s one of the most obvious, and obviously useful reasons to deploy feature flags. When a feature is ready for delivery, (or at minimum, it has passed all testing in your staging or pre-production environment) it can be deployed to production in a dark way. This means that the binary containing the new feature is in production but no user can access it as the flag is turned off.
At Split, we first toggle the new feature on for internal users to complete testing. Once it’s ready and the functionality has passed all testing criteria, we will ramp up the feature and expose it to 5%, 10%, 25%, 50%, and 100% of our users. For some feature releases, we’ll literally stop at each of those percentage rollouts to confirm everything is still working as intended before moving on. For others, we’ll use a subset of those steps. Only once we’ve reached 100% is the feature considered to be fully rolled out, at which point we remove the flag.
We also use flags to gate functionality based on the product tier an account or user is in. This is a really common feature flag use case. For example, if you are a free customer you only get access to email support. However, for our paid customers with premium support packages, we, via feature flags, enable chat support as well.
A product can be automated so that when a user upgrades to a new product tier, a feature flag is updated to include this new customer in the allowed customer list that has access to premium support functionality, like chat.
In many SaaS companies, customer success and engineering teams require some degree of access to production and customer data in order to help customers with their support requests. This obviously comes with a variety of regulatory and compliance issues, depending on your industry and certifications.
A practice we’ve adopted at Split is to gate the access to customer data or impersonation through a feature flag. Only a limited set of employees who have passed a rigorous background and financial check can have access to customer data. Every time new access is required, a feature flag grant request is created, a Split administrator can approve or reject the feature flag change request, and upon approval, this employee, via a feature flag grant, can access the impersonation functionality. For this, we leverage our recently released feature; approval flows. This segregation of duties is a key part of the SOC2 certification, and not having this practice in place can delay the certification approval process.
4. Infrastructure Migration
Feature flags are commonly used to help with technology migrations and to migrate from monolith to microservices. At Split, we use flags where there is any migration of technologies, for example, while evaluating a migration from AWS Kinesis to Kafka. Stick with me on this one, since we’re going to dip a toe into the world of experimentation, and how it’s enabled by feature flags. In a typical scenario, you would place a flag to enable a dark-write (or double writes) operation into the new system to test traffic live and verify how it will perform in production. Then a second flag is created to enable dark-read, similar to the prior flag to verify the read performance without affecting the performance of the user (hence, dark reads). Finally, a third flag is created to switch over the traffic to send requests to the new solution.
Throughout the life of Split, we have had a few opportunities to replace existing infrastructure, typically as part of a scaling conversation. Before we dig into the migration itself, we have to answer the question “Is the new system more expensive than the current?”. The quickest and lowest-risk approach to answering that question is to place the new system being evaluated next to the current one and send dark traffic for a short period of time, and then extrapolate the cost. Doing that is more resource-efficient since one can run an evaluation for one day with none to little side effects and extrapolate the cost.
At Split, we used this technique to evaluate a migration from using Kinesis Stream as the queue to receive all incoming flag evaluation data to SQS. SQS was placed behind a feature flag that allowed dark writes with the purpose of gathering data for 24 hours to then extrapolate what it would cost if we were to run it permanently. We were surprised to find that it ended up being a more economical and more performant solution and we prioritized resources to move to SQS in the end.
5. Feature Flags as Circuit Breakers
Michael Nygard popularized the Circuit Breaker pattern to prevent a cascade of failures in a system. We use feature flags as a main disconnect for functionality that is critical to behave within certain values of tolerance. If those values are exceeded, a simple toggle can disconnect that functionality from being used or alternatively use percentage rollouts to prevent it from being used excessively. The end goal? Make sure that system downstream is stable and healthy.
At Split, we use this pattern for things like external API endpoints, data collection services, frequency of synchronization with external systems, etc.
6. SRE Runbook Automation
Because we use feature flags as manual circuit breakers, it is relatively easy to automate remediations when certain conditions are met. For example, if we gate certain functionality like data ingestion from source A, and that pipeline is getting more load than the system can handle, we can enable (or disable a flag) to indicate that a certain amount of noncritical traffic should be dropped to preserve the integrity of the system.
Currently, we are experimenting with Transposit to build automated runbooks so engineers can act automatically following a pre-established process to mitigate an incident. These processes will involve disabling, enabling, or changing the exposure of a feature flag as part of the runbook and with a click of a button. As part of this work, we’ll be excited to release runbook templates for our customers to use. Stay tuned!
This approach can be controversial since many logger frameworks allow you to enable debug or verbose mode natively. The advantage of using flags for this use case and wrapping a more verbose logging level around a feature flag is that you can target a specific customer or condition, vs doing that at the logger level, which is more coarse and tends to be more binary; verbose on or off. With feature flags, you can target a verbose mode for network traffic for a given user and set of users within a certain account, or a user agent, among others. Once the debugging session is done, the flag is turned back off.
We use this technique at Split when a support ticket is escalated to engineering for deeper analysis, and it has contributed to lower support request resolution times. One particular example is a flag that enables debugging for our SAML (single sign-on) functionality. Historically it has been an area with recurrent support tickets given the number of third-party identity providers, each of which has their own nuances. Having this logic toggle to turn on verbose logging has helped our support organization reduce support ticket resolution time.
Learn More About How You Can Increase Productivity and Reduce Uncertainty with Feature Flags
I hope the set of use cases mentioned above in this post can serve as a starting point for those readers that are new to the concepts of experimentation and feature flags, or to deepen the usage of Split product for those who are already using Split.
If you’re ready to get started with Split you can sign up for our forever-free tier. If you’d like to learn more, check out these resources:
- The Dos and Don’ts of Feature Flags
- Testing a Feature-Flagged Change
- Build a CD Pipeline in Java with Travis
- A Simple Guide to A/B Testing
- How to Implement Testing in Production