Migration of a Refactored Microservice with Zero Impact to Customers

Socure
The Socure Technology Blog
5 min readNov 30, 2022

--

By Kalaimani Ranganathan, Staff SDET; Sathya Srinivasan, Senior Staff Software Engineer; Sumit Kumar, Senior Software Engineer; Swami Subbarathnam, Senior Director of Engineering

Background

A software architecture usually accumulates technical debts over time because of the continuous addition of new features and enhancements. Those debts cause performance inefficiency, code/design maintainability difficulties and other bottlenecks. Software teams try to clear those debts by refactoring their code periodically. Often, those code changes are substantial which makes them quite risky to release in production.

Problem Statement

One of the many microservices we run internally at Socure is part of an important product line. Any error or downtime in that particular microservice would disrupt availability and performance for many customers. We wanted to refactor the code for that particular microservice. It was not causing a performance inefficiency or other problems affecting any SLA (Service Level Agreement), rather it was causing problems for code maintainability aspects as a technical debt.

Challenges

Design and implementation of the refactoring posed several challenges as described below.

The redesign, coding, and basic testing were not challenging much. However, the first challenge was ensuring a full coverage while testing each API exposed by the microservice to qualify the refactored code for production deployment. Especially since some of the APIs and their use cases have been in use for a few years, it was difficult to achieve 100% coverage by testing the legacy code.

The next challenge surfaced before moving the refactored code to the production environment. Since this microservice is a legacy service, the set of use cases supported by this microservice was large, and hence there were many permutations and combinations of those use cases served so far. While most of those use cases were validated in the pre-production environment, there were many unknown use cases supported for a long time in the production environment. Thus the challenge was: what if those use-cases break just after the refactored code would be deployed in production? So it is quite risky to roll out the refactored code only on the basis of the testing done in the pre-production environment. As that particular microservice implements the call-flows for some of the largest customers it further adds to the risk factor.

Therefore, we had to achieve two challenging goals:

  1. None of the new APIs would cause any functional errors.
  2. All of the new APIs would function exactly the same as their existing API counterparts for the same input request.

Additionally, qualifying the refactored code within a short span of time would be challenging but essential.

Solution

This is how we implemented the solution:

Evaluation of the new APIs in production

The APIs of this particular service were invoked by another service. We employed a shadow framework that performs execution across real-time (current) as well as shadow (new) paths and compares their results in real time. This was done for about 10% of all the production traffic for a period of a few days (it was configurable). The results were logged and metrics were emitted. Though both versions of the APIs were invoked, only the result of the old API was returned for the calls from the customer, to avoid any regressions.

The results are compared in near real time and associated deviation metrics were analyzed offline. We observed that all the calls in the production environment yielded exactly the same results between old and new (refactored) code. This gave us high confidence that the new APIs were at par.

Why did we choose to experiment with only 10% of the production traffic?

Considering our call patterns for random patterns that come in sporadically by analyzing several metrics in production, we could conclude that 10% of the traffic for this period covers nearly all the use cases. The volume of the calls that were falling in the said criteria was so huge that it covers all the use cases actively being served (they are all “read” calls, no writes). Importantly, the additional calls incurred extra cost and slightly increased response time (owing to the comparison done), therefore we didn’t wish to experiment with a larger share of the production traffic.

Phased roll out of the new APIs:

Although we had high confidence that the new APIs were working well, we were still cautious about the roll-out strategy. We planned to roll out the refactored code path gradually for which we developed a fine-grained feature flags tool across multiple dimensions to do a canary deployment. Thus we controlled the overall traffic in the production environment and also the set of customer accounts that would take the refactored code path.

When the refactored code was enabled through the feature flag framework, first we verified the same, using many internal accounts by diverting 100% of the traffic from those test accounts to the refactored code path. For those selected calls, there were no parallel calls made but only for the new version of the APIs. Load tests and other tests were performed to verify the functionality and stability.

Once the test results were positive, we opened the gates for the customer accounts gradually, i.e., 10% of traffic for a subset of customers first, and went on further to include all the remaining customers and increased the share (20%, 30%, etc.) of the traffic until all the customers were dialed up to 100% on the refactored code path.

During all these steps, we gathered adequate metrics and logs to analyze any errors or spikes in latency due to the new code. However we didn’t experience any errors at any point in time. Still, considering a worst case scenario, we kept a global feature flag to turn off the refactored code for all customers or for only a particular set of customers. However, we never needed to turn off the refactored code as the gradual enabling went smooth — thanks to the evaluation phase.

Overall, these were the benefits of the phased deployment. Even though it took some extra effort to build out the phased deployment framework, it was completely worth the time and effort to deploy the refactoring changes of a critical service to production with zero impact to the customers.

Summary of learnings

  • Plan
  • Profile the application with available data
  • Understand different knobs that are needed to control/shape traffic
  • Test thoroughly (functional, load etc.)
  • Determine key metrics for successful production cutover beyond the controls
  • Determine derisk factors for a successful migration — customers, product, api, etc.
  • Use the tooling developed to monitor across multiple hours/days (improve coverage)
  • Communicate internally and externally; avoid surprising upstream and downstream stakeholders

--

--

Socure
The Socure Technology Blog

The leading provider of digital identity verification and fraud solutions.