Evolution of Technology Migrations : A Case Study of Java 17 Upgrade

Gandharv Srivastava
Capillary Technologies
8 min readApr 12, 2024

— In collaboration with Anuj Gupta

Capillary’s platform handles hundreds of millions of customers’ data and transactions every day, with an expectation of an uninterrupted and immersive streaming experience. Behind the scenes, numerous systems and services work together to orchestrate the product experience. With this level of scale, it becomes our responsibility to keep these backend systems consistently upgraded and optimized to meet and exceed customer and product expectations. With every significant upgrade, there’s the task of migration and rollouts, all while aiming to execute these transitions smoothly and without any downtime or significant issues.

With this in consideration, we decided to upgrade one of our major micro-services responsible for handling capillary’s major APIs to Java 17 from the previous version of Java 8. This process primarily entailed upgrading our Spring Boot version, which consequently led to updating numerous dependencies and libraries to support this upgrade.

Preview

In this blog, we will share our migration journey as we upgrade a major micro-service from Java 8 to Java 17. We will delve into our approach of simulating production traffic to rigorously test all modifications before proceeding with incremental rollouts. Furthermore, we will evaluate the advantages of this new methodology in contrast to our previous migration and rollout techniques. Join us as we narrate the story of this significant migration through the lens of Java version upgrade.

Motivation

While migrations can be challenging, moving to Java 17 has proven to be a wise decision. The benefits are particularly noticeable when dealing with a substantial number of dependencies, as was the case for us. Java 17 brings improved performance, optimized code, and a host of other advantages outlined in the referenced article. Imagine being able to deliver the same feature with less code, easier to read and hence increasing the productivity of the developers. This advantage is significant, especially considering that some dependencies no longer support older Java versions.

Another major motivation behind our upgrade was the Coordinated Restore at Checkpoint (CraC) support which was introduced from Java 17 and Spring Boot 3.2 or above. With the promise of reducing our service startup times to milliseconds which would be a massive improvement over the 10s of seconds it used to take earlier. Even though this is not achieved completely in this project right now, we are actively working on it.

Approach

We aimed for a strategy that allowed us to upgrade Java and Spring Boot versions confidently without disrupting existing API flows. Simultaneously, we wanted to identify and address major issues related to these upgrades before transitioning any traffic to the new instance. While our previous approaches involved manual testing of all APIs and reliance on automated test suites, we found that this method had significant blind spots. In a multi-tenant system like ours, each tenant uses the product differently, resulting in a wide range of use cases that cannot be fully covered by test suites alone. This diversity of use cases cannot be ignored during a migration of this scale.

To tackle these challenges, we wanted to use some key approaches :

  • Canary Releases : A well-tested approach, as discussed in detail here, where updates are rolled out to a small subset of users or servers before being deployed to the entire infrastructure. This allows for early detection of issues and mitigation before wider deployment.
  • Simulation-based Migration : Production traffic is directed to the existing Java 8 instances while concurrently simulating traffic on the new Java 17 instances.
  • Isolation of Processing : Despite the simulation, responses for each request are handled by Java 8 instances, while asynchronously processed in isolation by Java 17 instances.
  • Comparative Analysis : Responses from both Java 8 and Java 17 instances are logged and compared to identify any discrepancies, allowing for the detection of differences between the two versions.
  • Incremental Redirects : After addressing and resolving identified issues, requests are incrementally redirected to the new Java 17 instances, ensuring a gradual transition with near absolute confidence.
Old flow where request is processed by Java 8 instance
Mirrored flow where a request is processed by Java 8 instance and in parallel request is relayed to Java 17 and their responses are compared for issues
Redirected flow where all requests are directed to new Java 17 instance

While this approach was great for idempotent method calls (GET, HEAD, PUT, DELETE), we would run into a problem with this approach for non-idempotent requests (POST, PATCH), with the simulated flow inadvertently modifying actual production data and leading to possible data corruption and user dissatisfaction.

For non-idempotent calls, we implemented a detailed approach by recording all production requests intended for simulation from the old Java 8 instance. We established two simulation instances: one running Java 8 and the other Java 17. Each instance had mocked responses for external downstream service calls within their respective simulation environments. These simulation services operated with separate databases and infrastructure services such as queues, all kept isolated from production and each other. The databases were initialized from a snapshot of the production database to ensure that both simulation service databases remained synchronized.

After recording the requests, we proceeded to replay them simultaneously on both simulated services, starting from a point in time later than the database snapshot. This ensured that both databases were in sync, and ideally, in bug-free scenarios, they would produce identical behavior, including the identifiers generated by databases. By replaying all recorded requests, we could compare the responses from both simulated services to identify any discrepancies or issues resulting from the Java or library upgrades. This meticulous approach allowed us to catch major issues without modifying production data or making any update calls to downstream micro-services.

Simulation for non-idempotent calls

Execution

API Rules

To implement our approach effectively, we introduced API rules, each detailing :

  • API details : This includes the request method and API signature, providing specific information about the API.
  • Sampling Ratio : This parameter indicates how many requests out of the total should be simulated or directed to the new instance. For example, a value of 10 means that 1 request out of every 10 production requests will be simulated or redirected randomly.
  • Endpoint Switch : This parameter determines the state of the API and is represented by -1, 0 or 1 :
    - Old State (-1) : Use the existing flow by routing requests to the Java 8 service.
    - Mirror State (0) : Use simulated flow for idempotent calls; Record request for non idempotent calls
    - Redirected State (1) : Redirect the api to new Java 17 service
  • Fields to be excluded : This specifies certain fields that the comparator should ignore when comparing responses in the simulation flow. For instance, the `autoUpdateTime` field was generated at the time of response, resulting in different values from both services. Therefore, we chose to ignore this field during the comparison process.
Sample JSON explaining API rules with details explained above

Record and Replay

For update requests, after recording the requests based on the specified rules, we proceeded to replay them simultaneously on both dummy instances. The recording process adhered to the rules mentioned earlier. However, the replay operation was configured using tasks that included the following details:

  • Replay-From: Specifies the time from which the requests should be replayed.
  • Replay-To: Indicates the time until which the requests should be replayed.
  • Tenant IDs: Specifies the tenants for which the requests should be replayed.
  • Requests: Describes the details of the request to be replayed within a specific task.
  • Task Details: Includes necessary information to ensure the replay operation is idempotent, meaning it can be safely executed multiple times without causing unexpected effects.
Sample JSON showing a replay task

Both dummy services utilized databases restored from point-in-time snapshots taken from production databases. These snapshots were retained for scenarios where actual issues or bugs arose with the new service. In the event of any issues, our process was as follows:

  1. Fix the actual issue.
  2. Remove the restored databases containing corrupted data.
  3. Restore the databases from snapshots again.
  4. Replay the requests, ensuring the issue is resolved.
  5. Proceed further once the replayed requests are successful without the previous issue.

This approach guaranteed an idempotent method without encountering state management issues between issue fixes. Moreover, it allowed us to replay multiple APIs concurrently, compressing requests that would typically span several days into a shorter period, thus reducing machine costs.

Additionally, we incorporated a functionality to trigger additional APIs after each call to verify more data points. These data points were not included in the POST, PUT, or DELETE responses, and this verification process was executed without directly accessing the database.

Comparator

We designed the comparator to analyze responses from two different sources, regardless of the API type (fetch or update). This comparator was integrated with a smart dashboard that provided insights into the number of unmatched responses and highlighted specific unmatched fields within those responses. This approach significantly reduced our efforts in handling false positives during the comparison process. Additionally, the dashboard allowed us to compare response times between both instances, minimizing issues related to response time differences.

Dash boarding on failure counts
Dash boarding on failure trend
Dash boarding on time trends

Furthermore, we implemented alerting mechanisms to notify us in case of any mismatches. This proactive alerting system enabled us to address issues promptly rather than taking a reactive approach, thereby helping us effectively manage and mitigate potential issues before they escalated.

With this migration we have been able to roll out multiple APIs in our continuous incremental rollout. While we did encounter various issues and bugs related to our upgrades, our simulation-based approach enabled us to catch them during the simulation phase itself. Importantly, no errors have directly impacted our customers so far, which has boosted our confidence in the ongoing migration process to migrate all other APIs as well.

Key Learning

A technological upgrade in any existing system, especially one with a large user base, can introduce behavior changes due to unknown unknowns. Our simulation-based approach followed by incremental redirection played a crucial role in achieving our large-scale migration with minimal disruptions. Drawing from our past migration experiences where we heavily relied on automated test suites, we learned that replicating end-user behavior between the current and migrated states in simulation mode helps uncover all unknowns, allowing us to address them before fully transitioning to the new state.

We have benefited greatly from replaying requests on simulations, identifying major issues, addressing them, and then replaying from the start to ensure a smooth transition. This iterative process has enabled us to catch and resolve issues before they impacted production, resulting in a smoother migration experience.

The traditional approach of flipping a switch and redirecting traffic to a new instance with all the new features is no longer feasible. Instead, we have adopted an incremental approach for traffic redirection, with the ability to roll back redirection if significant issues arise. This shift in mindset and methodology has proven to be more reliable and effective in managing complex migrations while minimizing disruptions.

--

--