So Long, Legacy! A Tale of 2 Peel-offs

Nicole Tempas
SSENSE-TECH
Published in
8 min readJan 7, 2022

As much as the SSENSE Tech Team would love for our code to be a work of timeless beauty, we and other rapidly growing companies have shifting requirements and developers of differing opinions. This puts us in the predicament of having to work with a system where a production environment that once solidly supported a small startup is struggling to support a company whose technology underpins an industry innovator with impact across the globe.

Enter, the architecture peel-off, understanding when and how to peel-off is an integral part of steering the evolution of a system’s architecture.

I’ll be examining 2 peel-offs currently in progress at SSENSE. Let’s start with the rationale behind each.

The Breaking Point

These days SSENSE tech handles a high volume of traffic without so much as batting an eye. But it was only a few seasons ago that the drop of a sought-after sneaker or first day of markdown would result in an astronomical number of requests that would hammer the legacy database to the breaking point, bringing the whole website down with it.

One particular flow was the culprit: fetching customer data. Migrating legacy data to a new customer data source had started but was far from complete. The new customer data source wasn’t a source of truth yet so every call to fetch customer data also went to the legacy database to sync.

Modification Madness

In theory, adding a single email template for the launch of a new language would be a breeze but the labyrinthine legacy returns system is a far cry from logical or predictable. Errors appear and disappear with such peculiarity that it perplexes even the tech lead known as the ‘Elder Wizard’. Simple features take a full sprint and trying to make changes as SSENSE expands returns perks to new regions slows to a snail’s pace as developers navigate a difficult-to-follow codebase.

Now that we’ve laid out the challenges we’re facing it may seem as if these 2 problems are very different. At first glance, the solutions for an overloaded database crashing a website, and a difficult to modify system preventing straightforward development and bug fixing do not sound like they’d have much in common but the underlying approach, strategy, and areas of focus converge in unexpected ways.

The Peel-offs

Customer Peel-off

Let’s start with the customer peel-off, where the primary objective was to alleviate the load on an overburdened database.

When a problem is causing a big impact, putting in place a temporary fix can help while developers and architects dedicate time to designing a better system. The team did this by adding a circuit breaker that would assume the new datastore was the source of truth when the system was approaching the breaking point. Although not without detriments, this ensured the website would stay up until a more comprehensive solution could be implemented.

Once the circuit breaker fix was in place the question became: is the high volume of calls spread out across the service or concentrated on a few endpoints? A single endpoint was the culprit. Since excessive calling of that endpoint in and of itself was substantially contributing to crashing the website, the ideal scenario would be to ensure the new customer data source was always the source of truth for that specific subset of customer data, eliminating the need for the legacy call to sync. Everything else could stay within a circuit breaker as a precautionary measure but in all likelihood peeling off this subset of customer data could have a monumental impact.

Now that their focus was clear, the team embarked on a meticulous code investigation to identify all write operations to the legacy database that were not reflected in the new data source, adding logs to get visibility into the legacy stack. Once those places were identified, the team utilized AWS SNS/SQS to reconcile the data discrepancy. At each spot a write call was made to the legacy they emitted a message to an SNS topic. In order to not cause additional load on their service, a worker processed these messages, calling a lambda that updated the new source of truth.

With markdown season fast approaching it was time to put their solution to the test. SSENSE services undergo rigorous load testing before every markdown period and combined with the use of Feature Flags in multiple environments the team was able to emulate the high traffic they were expecting.

Returns Peel-off

Now let’s explore an entirely different challenge. Creating a new system from scratch. This can be a daunting prospect, but the team was aligned on their first 2 goals:

1) Replicate all data in a well-designed system that would become the new source of truth.

2) Own the initiation of certain key processes that would need to be updated to support the expansion of SSENSE Fulfillment Centers.

After extensive digging into the depths of legacy code, the team mapped out all existing flows in Lucidcharts. What data was passed through which service was carefully documented with notes on database tables and columns as well.

Now that the team had an understanding of the data they’d be working with and feature parity to be achieved, the technical decision making began in earnest. A concise data model was sketched out. Access patterns were brainstormed and in conjunction with the data model this knowledge was used to select the type of data store for the new source of truth. Collaborating closely with the team’s architect, the choice was made to build the new system as a microservice in TypeScript comprised of a sequence of lambdas orchestrated by AWS Step Functions. It would be accessible by clients and call dependent services through API Gateways, and would draw on the strengths of AWS SNS/SQS to synchronize with legacy and communicate updates to dependent services.

Phase 1 started with a clean slate, an exciting opportunity to find yourself in. It’s not every day that developers have a chance to start fresh with no technical debt. The first step was cloning the skeleton repo and building the state machines, the backbone of the new system.

Data parity was achieved by subscribing to legacy SNS topics and the state machines ingesting the messages and performing the corresponding CRUD operations. After surmounting numerous obstacles, including a particularly challenging ORM, the team succeeded in saving all data on both legacy and new systems and importing the legacy data into the new source of truth to ensure backward compatibility.

Buoyed by successfully creating a new microservice from scratch the team dived into Phase 2. As SSENSE continues to grow, the evolution of global markets and their corresponding business requirements necessitate building flexible systems able to adapt and address a cascade of changes across the entire architecture. The returns architecture was designed with this in mind. Utilizing the modifiability of their new system, the team refactored the state machines to add lambdas that initiated key processes previously triggered in other services including the business logic to support fulfillment center expansion, and Feature Flags to control the release process.

What do these thoughtful approaches to solving both problems have in common?

  • Quality implementation is anchored in a thorough investigation.

Arguably, at no point is it more critical to understand your dependencies than when your changes are causing things to no longer exist.

The customer team went through the painful process of looking file by file through the legacy repositories and adding logs to each database operation to determine where there were reads and writes going to the legacy database that were not reflected in the new customer source of truth. This thoroughness paid off as they could then transition with confidence to solely using the new returns source of truth for that subset of data and the legacy sync calls would no longer be necessary.

It can be just as valuable to know what you do not want to do as what you do want to do.

The Returns team’s Lucidchart flows revealed an unprecedented amount of complexity with frequent database reads and writes to multiple data sources. Although the bulk of the logic was in the legacy system, almost a dozen services were involved in a spiderweb that stretched from the website, to legacy, to the shipping services. The team resolved to simplify, consolidating into a single microservice whose initial state machine received data and performed 4 core operations: validate, format, save to returns source of truth, and emit events. These operations were then each implemented as a lambda function with encapsulated code invoked on demand and could be generic enough to reuse. In addition, the data model was stripped down to essential domain-relevant information.

  • Event-based communication can be a valuable way to synchronize disparate processes in a low-contact way.

In the customer peel-off, instead of pointing each write call they found to both the legacy and new data sources they emitted to an SNS topic which had a worker with a lambda to process the messages and save to the new source of truth. This vastly reduced the changes they had to make in other systems, and kept the majority of the work within their scope of control. There were drawbacks as ideally all calls should point to the new data source so it is truly a source of truth, but as an initial solution it fulfilled the requirements allowing the team to move forward with the speed they needed to deliver before the approaching markdown period.

For the returns peel-off, the entire legacy data sync was powered by a return-sync SQS queue that subscribed to the legacy SNS topics with data to enable populating the payload and then saving to the returns source of truth. This enabled returns to retain the most up-to-date information across a return’s lifecycle and achieve data synchronicity without needing to make cumbersome legacy changes.

The End Results

So how did everything work out? Did the website stay stable during the next markdown period? Could the developers build new features to support growth across the world?

Yes! The customer peel-off succeeded in alleviating pressure on the legacy database when the new data source no longer needed to sync back. During the next markdown period the website was much more resilient, able to stay up amidst the exponential increase in requests. As for returns, they’ve built the features to support the expansion of our fulfillment centers in record time.

Moving forward, both the customer and returns peel-offs have designs for further phases to build on the work that’s been done and continue to improve the resiliency of our systems and the ability for technology to serve as a catalyst for SSENSE growth and innovation.

Editorial reviews by Liela Touré, and Mario Bittencourt. Want to work with us? Click here to see all open positions at SSENSE!

--

--