Modernizing a Legacy Endpoint and Why It’s Worth It: a Step-by-Step Guide

TL;DR:
Modernizing a legacy endpoint led to 30% faster response times, 77% smaller payloads, and 1 million fewer daily database fetches. To know how… continue reading.

Poorvi Sharma
Booking.com Engineering
11 min readDec 3, 2024

--

Booking.com seamlessly connects millions of travellers across the globe to enjoy memorable experiences. Starting from a small Dutch start-up to one of the world’s leading digital travel companies, we’ve come a long way. But as we’ve grown, so has the complexity of our systems. It’s no surprise we’ve accumulated a lot of technical debt over the years. A majority of our business logic still sits in a Perl monolith. To stay ahead of the curve, we’re continuously investing in modernization efforts across different dimensions to simplify the architecture and reduce friction points.

Last year, Booking’s app platform team took one such initiative to revisit the non-functional endpoints of our Perl monolith. We evaluated each endpoint: we migrated relevant ones to a new Java service; we shut down the rest. Thus breaking free from the Perl monolith.

One of the most challenging yet interesting ones was the Update Management API, since it had become a service of its own due to excessive responsibility overloading. This blog post tells the story of how we broke this complex functionality into manageable pieces and migrated it incrementally to Java, all while resolving unexpected challenges, maintaining backward compatibility, and optimizing the system for better performance. We believe these steps serve as a template that can be applied anywhere to tackle technical debt of any scale.

Understanding the Starting Point: A Legacy Endpoint

The Update Management API is a 14-year-old endpoint, introduced when automatic app updates weren’t a feature in mobile apps. This endpoint ensured that users were prompted to update their apps when crucial bug fixes or features were released, aiding in version control.

By 2023, this endpoint had evolved into a massive one serving 21 functionalities beyond update prompts such as showing COVID banners, auth validation, cache resets, etc. Moreover, these features belonged to 7 different teams across the organization. It had become increasingly difficult to manage, highly volatile with long latency and a codebase that resembled spaghetti. Migrating it to Java required more than a straightforward lift-and-shift approach.

Update Management API Evolution Over the Years

Planning the Migration: A Thoughtful, Phased Approach

Rather than converting the existing Perl code directly to Java, which would have simply replated the spaghetti, we took a strategic approach:

Steps in Effective Modernization Planning

Step 1: Create a Visualization

The first step was identifying the distinct pieces of functionality within the endpoint. While the service file contained 1,500 lines of code, this was merely the tip of the iceberg. Each line potentially referenced numerous libraries, functions, or modules that either needed to be migrated to Java or restructured as separate services. The true scope of the migration extended far beyond the visible code, requiring deep analysis and decomposition of tightly coupled dependencies. To keep it simple without getting too much into how functionality is achieved, we drew a UML Activity Diagram showcasing and focusing on what and when different functionalities were served by this endpoint.

Step 2: Identify Internal and External Stakeholders

Although the legacy endpoint lacked documentation, git blame was enough to identify the authors of the blocks. We used this to identify the teams that are tenants on this endpoint. We classified Platform team members as internal stakeholders, and other contributors as external.

Step 3: Dividing Horizontally by Identifying Cohesion

Going through the code and drafting the visualization made one thing very apparent: the Update Management API endpoint exhibited temporal cohesion — it grouped unrelated functionality that was only needed at app startup. This was a major red flag. While not the worst kind of cohesion (like coincidental cohesion), it signalled that the endpoint could be split into parts, reducing its complexity. We identified two main functions:

  • one determining whether an update was required– Update Management API
  • and the other providing configuration settings for app startup– Init Configuration API

Functions that don’t fall into either bucket would need a new home.

Step 4: Dividing Vertically by Modularization

Once we had a clear separation of concern in terms of responsibility. We also decided to modularize the logic by platform (iOS vs. Android). Each platform had slightly different requirements and workflows. By isolating Android-specific logic from iOS-specific logic, we ensured that each platform’s functionality was handled independently. This approach reduced coupling between platform-specific code, making it easier to test and debug issues unique to one operating system without affecting the other.

Step 5: Identify Obsolete Fields

Before migrating the endpoint, it’s crucial to identify the obsolete fields to avoid unnecessary migration efforts. For this, we started with the internal stakeholder by sharing the UML to review and help us identify fields that were no longer in use. We focused on two criteria:
1. whether the field is used in the presentation layer or anywhere else,
2. whether null handling/default value is provided on the app response parsing code.
This yielded a set of identified deprecated fields ready to be cleaned up. But with apps, the cleanup process is never straightforward since older app versions could still rely on certain fields. To ensure stability, we used A/B Experimentation to validate these field removals and provided default backend values where necessary to avoid crashes. For the logic owned by external teams, we decided to carry out the discussion at a later stage, after doing some groundwork.

Step 6: Plan Iterative Migration

From the insights gathered from the above five steps, we divided the overall migration into phases.

Phase 1: Migrating core update logic and cleaning up obsolete fields.

Phase 2: Migrate non-update logic and finalize external stakeholder integrations.

Phase 3: Update client logic to call the new endpoints.

Navigating the Challenges in Migration

As we worked through the migration, the complexity of the Update Management API endpoint became more apparent. It was like peeling an onion — every layer revealed new dependencies, deeper connections, and a few tears along the way.

Phase 1 Challenges

The analytics parameters mystery

In the phase 1 experiment, we discovered that blacking out some of the deprecated fields caused a significant drop in marketing booking metrics. Initially, it wasn’t clear which field was responsible for this, so we launched a multivariate experiment with eight different variants, each blacking out one of the identified obsolete fields. This data-driven debugging approach helped us pinpoint the exact field causing the issue: analytics parameters.

  • The discovery was unexpected because this field was initially thought to be obsolete based on the discussions with the stakeholders. According to them, analytics parameters were used for a third-party tracking tool that was decommissioned in 2020.
  • We assumed some older clients might still be using it and decided to validate this hypothesis by only sending `analytics parameters` for versions released before August 2023 (the analysis start time) and blacking out for newer versions. We failed again since marketing booking metrics remained negatively impacted, signalling a need for a deeper investigation.
  • Next, we shifted focus to client-side code and uncovered that the `analytics parameters` name is misleading, as they’re involved in deeplinking logic. We collaborated with deeplinking owners. By analyzing events for a bunch of scenarios on our event store, we identified the pattern that revealed deeplinkId was being extended beyond its intended lifespan due to faulty `analytics parameters` response handling in the Update Management API. Removing these parameters resolved the issue and restored accurate tracking.

Phase 2 Challenges

Coordinating with External Stakeholders

  • Phase 2 posed even greater complexity since it involved migrating fields related to app configuration and other unrelated logic mostly owned by the external stakeholders. Things we knew little about and hence required close collaboration with them to ensure continuous operation of their services.
  • We initiated communication with external stakeholders via their team’s channel, sharing our plan document, checking if their logic was still relevant, or suggesting moving unfit logic from the Update Management API to separate endpoints. In most cases, stakeholders either no longer needed the logic or had already been working on initiatives aiding in moving away from our endpoint, allowing us to clean it up immediately or wait for their migration before lifting the functionality.
  • For a stakeholder who couldn’t match our timelines, we mutually agreed to expose a temporary endpoint that would be called via our Java service to maintain their operations while they worked on a long-term solution: serving this functionality at a later stage in Booker’s journey, thus moving away from app startup.

Database Update Failure

A significant issue arose when we disrupted the logic for updating one of our tables via the Update Management API revealing one of the endpoint’s hidden dependencies: a Marketing Campaign. Upon noticing a drop in conversions for the campaign, it was traced back to our endpoint migration.

To fix this, we stopped the experiment. On the Java service, along with putting the solution in place, we also added integration tests and enhanced unit tests to ensure that database inserts and updates were correctly handled going forward. We also added the missing alerting and monitoring around these DB ops, ensuring early detection. With the code fix and these safeguards in place, we reran the experiment and confirmed that the issue was resolved.

Wrapping It Up

The entire process took over a year; it was a bumpy ride, but the results were remarkable. The improvements listed below are what we saw after finishing phase 2, despite keeping monolith as a proxy.

Key Wins

  • 30% Improvement in Latency: We achieved a 30% improvement in wall-clock time, reducing the average response time from 145 ms to 100 ms. This reflected on the app’s home screen Time to Interactive and startup time, enhancing the user experience during critical moments like app startup.
  • 77% Reduction in Response Size: By cleaning up deprecated fields and optimizing the response structure, we reduced the payload size by 77%, bringing the p50 response size down from 1 KB to 230 B. This optimization is especially impactful for users on slower networks or with limited mobile data plans, making the app much more responsive under such conditions.
  • Cleaner and Robust Codebase: For an endpoint with numerous dependencies and responsibilities, fragility and errors were inevitable. However, the migration provided an opportunity to invest in better exception handling, default value management, and creation of smaller, well-defined services that are much easier to test. With a deeper understanding of the endpoint’s behavior and by adding thorough unit and integration tests to the new Java service, we achieved a 75% reduction in warnings and an 80% decrease in exceptions, significantly enhancing the overall reliability of the endpoint.
  • Reduced Database Load: During the migration, we noticed inefficiencies such as duplicate database queries. By reorganizing the logic and reusing already fetched result objects, we managed to reduce the database load by approximately 1 million fewer fetches per day.
    Apart from this, lifting the deprecated use cases had similar impacts on DB load. We reduced the average number of database queries from 7 down to 1 per request. This relieved similar pressure on other databases.
  • CPU Usage Reduction: Despite retaining some proxying overhead and some of Perl’s redundant request handling (e.g., context setting, tracking, etc.), CPU usage dropped by 50% (from 60 to 30). This is immediate cost-saving since for the same number of servers, we can serve more requests.

Key Learnings

  1. Incremental Migration Works Best: Migrating such a complex endpoint requires a phased, iterative approach. Breaking down the functionality into smaller, manageable pieces ensured we could troubleshoot issues in isolation and make improvements gradually.
  2. Collaboration is Crucial: Close collaboration with stakeholders and other teams was essential, especially when dealing with older features or shared logic.
  3. Revisiting Legacy Code Pays Off: Over time, features and functionalities can bloat an endpoint. Incremental degradation might not be very noticeable but as seen in the results, it can have a noticeable impact. Revisiting and refactoring these parts of the codebase helps make systems more maintainable, efficient, and cost-effective.
  4. Expect the unexpected: When working with complex, tightly coupled legacy systems, it’s essential to acknowledge that not all dependencies or implications are apparent from the code. Staying flexible and addressing issues one at a time allowed us to navigate these surprises effectively.
  5. Long-Term Impact: Given the migration of a single endpoint had a significant impact on CPU usage, database load, and potential operational cost reduction, imagine the compounding benefits of applying similar optimizations across other high-traffic endpoints. This proactive approach will lead to a more cost-effective and innovative future by not only helping shut down the old servers in our Data Centeres but also ensuring lower computational costs if the code runs in the cloud. Moreover, with cleaner code and architecture, you also improve developer productivity and Time To Market new features because remember:

Code is read more often than it is written.

Paving the Way for Future Migrations

The migration of the Update Management API endpoint from Perl to Java was a long and complex process, but the results speak for themselves. Not only did we improve performance and stability; we also set the foundation for future migrations across our platform. The lessons learned here can be applied to other similar initiatives, making this a template for how to tackle technical debt and keep our platform scalable, maintainable, and efficient. If you’ve made it this far, I hope I’ve given you some practical takeaways and convinced you that refactoring and revisiting legacy systems isn’t just an option — it’s essential.

Looking back, the most time-consuming task was not being able to identify the obscure dependencies early on. If only there was an efficient tool that could go over the messy legacy code– resembling a nested graph, exploring every branch till the end and thus exposing all the dependencies at the time of planning, the whole journey could have been smoother and faster. With the rapid evolution of technology and the AI wave, how far are we from turning our imagination into reality?

--

--

No responses yet