Safely Moving Away from Monoliths: Chapter 2

Although incrementally migrating from a monolith to a services architecture isn’t easy (or cheap), there are many things Split’s feature flags can do to make the transition a little safer for all parties involved. But not only that: after going through the process ourselves, we learned how quickly and efficiently Split could help us recover from failures, minimizing customer impact and helping us achieve our goals. 

In our last chapter, we began our journey by identifying which domains we should decouple and a migration path forward. Now, let’s dive right into the move itself. Microservice architecture, here we come (safely!). 

Moving Incrementally Towards Our Target Architecture

Once we identified where we wanted to go, we started weighing our options to migrate over the business logic. We want to minimize risk and try to reduce the required effort as much as possible. Our chosen first move was to create a microservice exposing a RESTful API with CRUD operations over segment entities. 

This would allow us to modify the data access pattern in the monolith. Instead of using direct access to the database, the monolith would use an HTTP client connecting the new segment service.

Note that this is a transitional step towards the target architecture. However, it is a step that we considered taking safely, allowing us to keep moving forward.

This doesn’t come for free:

  • How do we respond if there are failures after switching over?
  • When talking about a persistence layer, how do we roll this out progressively?
  • Can we set this up so that customers are minimally impacted?
  • What is the rollback strategy if things go sideways?
  • What is the potential impact that this change has on application performance?

To answer these questions and achieve our goal we have implemented a strategy: Switching Classes Strategy with Feature Flags. We will describe how this was put in place and how it helped us in the following section.

Isolating the Domain and Exposing a CRUD API

We started by moving our initial architecture, where the database was accessed by multiple internal services to a transitional state. The database access itself was isolated behind a service:

Moving the data access code itself to a service was not a big deal. But moving the consumers to use the new service had to be done in a safe, incremental way. To achieve this, we created a DAO interface for segment objects. The implementation was the existing one in the monolith (accessing the database directly), and another one that accessed the service. Consumers of segment-related data wouldn’t directly know what they were accessing. They would only interface with the SegmentDAO, which internally decided which data source to use.

We then added a third implementation to our DAO interface: SwitchSegmentDAO. This class implemented the same interface as the other ones. But it had an internal function to identify what is the active DAO that should be used. This one was controlled by a Split! 

The activeDAO method

The activeDAO method in the SwitchSegmentDAO returned either a DatabaseSegmentDAO or a ServiceSegmentDAO instance. This was based on the Split evaluation (on or off) for a given account.

private SegmentDAO activeDAO(String account){ if (ON.equals(splitClient.getTreatment(account, "enable_segment_service"))){ return serviceDAO; } return databaseDAO; }
Code language: Arduino (arduino)

We leveraged Split’s targeting rules in order to enable our service DAO for one account at a time: 

It is a simple function, but very powerful. Our DAO exposed several operations on the domain, it would be cumbersome and error prone to modify one by one. With this approach, we could modify this code:

public boolean save(Segment segment){ return databaseDAO.save(segment); }
Code language: Arduino (arduino)

To this:

public boolean save(Segment segment){ return activeDAO(segment.accountId()).save(segment }
Code language: Arduino (arduino)

We could enable or disable the rollout of our new architecture little by little. This is possible just by making a Split change, both via the UI or via Split’s APIs. No need for a code change, a new deployment or even an application restart!

With this approach, we were able to answer our questions from before:

  • How do we respond if there are failures after switching over?
    • Turn off the feature flag!
  • When talking about a persistence layer, how do we roll this out progressively?
    • Via feature flags!
  • Can we set this up so that customers are minimally impacted?
    • Absolutely! We can enable this just for testing, just for our own organization, and then rollout to customers one by one!
  • What is the rollback strategy if things go sideways?
    • Turn off the feature flag!

This leaves one open question: how to measure the impact of this change in our application’s performance. Let’s discuss in our next section.

Measuring Success and Failure

Any feature released to production should have associated metrics. If it is a new capability added to the product, we can track specific metrics. We could measure how many users are engaged and how the conversion rate was affected by this feature. Technical metrics are equally important. Sometimes success is not guaranteed, despite all internal testing verifying that the new code works programmatically as expected.

We were moving from write and read operations communicating directly with the database to executing operations via a new service. Once a new service is introduced into the mix, it’s expected that latencies will be affected. For us, that’s a top of mind concern, as we have millions of API calls reaching Split every day. During this migration, we had to measure how much latencies were affected, and in which parts of the code.

Our first internal controlled rollout informed us that latencies were increasing in the order of tens of milliseconds. Our metrics showed that such latency was happening exactly in the communication between our services. We quickly realized that we were not leveraging our kubernetes cluster and service-to-service communication capabilities properly. Instead, we were going all the way to the load balancer and back. This was quick to fix, and led to an increase of less than 10% in our latencies compared to before. 

This was still not good enough. It is a journey of continuous improvement and trade-off analysis. Introducing a new service doesn’t come free, but at the same time, it does provide benefits. In this case, the new service isolated data access to the Segment domain. We safely improved data access with a significantly smaller effort than if these improvements were attempted in the monolith. 

We put in place a series of caching layers in the new Segment service. This is managed via feature flags and capturing metrics, which ultimately caused a decrease in latencies across the board. The public Split APIs handling Segment data had an improvement of up to 50% in latencies. This caused multiple operational metrics in our database to improve (such as decreasing the DB CPU by up to 20%). 

On top of latencies, we paid attention to our service connection metrics, response codes, latencies per response codes, and more. Each of these helped us better understand service behavior and continuously improve it. Over time we tuned our connection pooling and thread management.

As with any software delivery project, this wasn’t a journey without mistakes. As we were rolling out our changes, at one point we set our targeting rules to enable the new service. But only for 10% of our traffic, while the remaining 90% continued accessing our database directly. This was managed via Split itself, and once changed in the Split application, it immediately took effect in our services. 

It also immediately affected the metrics we were observing. Once the traffic started going to the service, latencies increased much more than anticipated, and errors started occurring. After realizing this was having a negative customer impact, we changed our split again and recovered right away. We spent less than five minutes activating problematic changes in production, detecting errors and reverting back to a stable state.

While production issues are definitely a problem, we design our software so that we can recover fast from problems. We want to foster a development culture unafraid to make a change.

Deploy more often, release without fear, and experiment.

What’s Next?

The effort described was only the beginning of a long road ahead. Having a North Star helps us keep moving forward, being thoughtful about our decisions, facing challenges and learning from them. Moving away from a monolith and into a microservices architecture isn’t cheap or easy. Even though the move was beneficial, we know there is still much to be done.

These are some of the improvement areas that we have been working on:

  • Finding better ways to share common functionalities that are needed across microservices
  • Implementing best practices for logging and tracing requests in a distributed architecture
  • Optimizing service-to-service communication on a case-by-case basis, considering different protocols and integration strategies (such as event-driven asynchronous communication)
  • Extracting other domains from the monolith and into their own service
  • Balancing all of the above while continuously delivering new capabilities and providing a top level of service for our customers

We know that uncertainties are a huge part of this journey and that there is no silver bullet solution. Each mistake helps us to learn for the next iteration, as is what happened during the experience shared here.

At the end of the day, we are the main customers of our own product. The capabilities provided by Split allow us to release safely, removing the fear of mistakes. The shorter feedback cycle helps us to gather information to course-correct, and enlighten us with new ideas to experiment with.

Want to achieve reliability and mitigate risk with feature flag management from Split? 

Schedule a demo!