Strangler Fig Pattern & Safe Language Migration in Action

Published in

Trendyol Tech

10 min readApr 3, 2024

In this article, we will talk about applying the Strangler Fig Pattern and how to achieve a safe rewrite and rollout a service in production when you don’t or cannot trust your tests.

The post is structured as follows:
1- Introduction: A brief introduction of the team.
2- Concerns: Why did we decide to rewrite the service in another language and the implications of this decision.
3- Traffic Mirroring to Achieve Safety and Gain Speed: How to test the new service against live traffic to make sure your new code is doing exactly what it’s supposed to do.
4- Strangler Fig Pattern with Traffic Shifting: How to rollout the new service alongside the legacy service, achieving safety and robust rollout/rollback mechanisms.
5- Conclusion: A wrap-up of the article.

Introduction

As PDP Buybox team, we are responsible for delivering product data to all teams (except search pages) rendering a product card in Trendyol.com some of which are: Storefront, Recommendation, Favorite list, and Checkout.
A downtime in our service means a Trendyol-wide outage and a heavy loss in total revenue. The service powering product data delivery was written years ago in Java over a weekend, some of the business logic in the service was left unmaintained and undocumented. Also, using Java meant longer uptime since pods take more time to warm up, and more memory usage and CPU cycles compared to Go. Some teams adopted Go as a primary language of choice, and our past experiences pushed us towards adopting Go as well. In total, the momentum was going in that direction.

Concerns

To address the above issues we decided that a rewrite of the service was needed. full-service rewrites are cool and shiny for tech-savvy people, risky and time-consuming for product owners, and the business in general. But we felt that leaving the problem unaddressed would be a ticking time bomb waiting to explode.

When you attempt to rewrite a service some questions need to be asked:
1) Are there enough tests?
2) Are there enough tests?
3) Are there enough tests?
4) How do I rollout my cool new service?
5) Can I A/B test my new service on live traffic?
6) How do I rollback to legacy when needed?

The above is not a copy-paste error, Are there enough tests is the most important question when it comes to service rewrites. Can you trust unit tests alone? Do I need Integration Tests? What about E2E automated tests? Is my test set covering all edge cases? what if I miss an edge case because it’s, well, an edge case? Can I trust my testing pipeline?
Ideally, I would like to have mathematical proof that both services produce the same output so I won’t bother writing test cases. But in the meantime, I’m settling for tests instead.

A language migration means that you cannot make use of existing unit tests immediately, and even if you attempt to rewrite them it’s still prone to error and not safe enough to depend on. We also have an automated E2E testing pipeline that prepares an isolated test environment similar to testcontainers where we run our business tests covering over 97% of the codebase. That’s safe enough for us when we rollout a new feature but still scary enough to not be deemed safe in this migration. Also, the automated E2E testing pipeline is a little bit slower than we’d like it to be. It takes up to 15–20 minutes to get it up and running and only then we can get the feedback. which doesn’t seem that long at first but it can easily add up to an hour or two if your commit ends up having some bugs that you have to fix.

So if unit tests and Automated E2E tests aren’t enough, what should we do?

Traffic Mirroring to Achieve Safety and Gain Speed

A common method to follow during service rewrites is to mirror (shadow) live traffic to your new service. In practice, you need to copy the request and send it to a second destination in addition to the original one. Clients are still going to be served by your legacy (original) service, the new service is not serving any clients yet, just receiving a copy of the request.
After getting both responses, we check for response equality and log any mismatch, and Voilà you’ve just built a continuous test with instant feedback.
There are multiple ways to mirror traffic. At Trendyol we are On-Prem and we use istio service mesh on top of our microservices. istio allows us to define traffic mirroring rules very easily on the mesh level. One caveat is that traffic in istio is mirrored in a fire-and-forget fashion, which means the responses are discarded.
To overcome this, one could make a third service that receives mirrored traffic, redirect the request to both services and then check for differences in the response bodies.

To put it in pixels:

Can you spot a potential problem in the above design?
Hint: path 2 and 3
Spoiler Alert: if one isn’t careful, requests sent via path 3 can be mirrored again resulting in a mirror request back to checker via path 2 which results in a request sent via path 3 which result in a mirror request back to checker via path 2 which result in…. Endless recursion!
This problem isn’t visible on small mirroring percentage but could be catastrophic on high percentages. Ideally, one should be to tell if the request should be mirrored (path #1) or shouldn’t (path #3) via some header value. If your mirroring solution doesn’t support that, you can just make a new copy endpoint for path #3 to send requests to without activating mirroring on the copy endpoint.

In summary:
1. a request is sent to the legacy service.
2. istio mesh duplicates the request and sends it to Checker, meanwhile, legacy service can serve clients normally.
3. once the checker receives the request, it sends two requests to the legacy and the new service concurrently.
4. the checker waits for both responses and performs a field-by-field check.
5. in case of any mismatches the checker writes an error log to stdout.
6. The ELK pipeline collects all logs and sends them to an Elasticsearch instance.
7. A monitoring trigger is defined on top of Elasticsearch logs that fires an alert to a slack channel for each mismatch.

Thanks to this method, we get instant feedback on each commit pushed to production without risking any bugs just yet. Testing against a fixed test suite with limited combinations is not comforting when you rewrite a service. This method, instead, can be thought of as automatic test generation against almost all possible scenarios.

Traffic mirroring can be configured with a percentage. At first, you might want to set it to a low percentage since one instance of a bug will likely be reproduced multiple times, generating lots of noise. Once we started getting no error logs indicating bugs, we increased the percentage to allow for more cases to be checked, ultimately we want to set the mirroring percentage to 100% without getting any bugs.
Traffic mirroring has other important benefits as well like testing for memory leaks and resource usage in general. To measure that we set the mirroring percentage to 100%. This allows us to check how the new service would behave before taking it to production.

Sidebar: when defining alerts on Slack or any other internal messaging program, a common concern is that alerts will flood the channel you’re sending alerts to. We have our in-house tools to prevent such cases and reduce the noise-to-signal ratio of our alerts. Make sure you know what you’re doing when you attempt to do the same.

So far we have covered the safety side. Now let’s cover the Strangler Fig Pattern side of the story.

Strangler Fig Pattern with Traffic Shifting

At this point, our cool new service has no bugs and is ready to be rolled out. Let’s imagine a scenario where the new service has a new DNS. We go ahead and contact our clients to inform them of the DNS change, all they need to do is to juuust change the DNS in their config file and redeploy their service. Now imagine that you have an edge case that you missed and your clients are now unhappy since your new DNS change caused said bug. It’s true that we have tested our service via mirrored traffic, but still, stuff happens.
Now your clients would have to rollback the DNS change which according to their deployment pipeline can vary, but still, take some time. If the bug happens to be critical your users aren’t happy, your clients aren’t happy, you caused an incident that nobody likes just because you wanted to rewrite your service.
One additional thing to think of is when your application has multiple endpoints and you are only migrating one endpoint at a time, this will be annoying enough for your clients to raise a few eyebrows when you tell them to use different DNS for different endpoints.
In our case, numerous clients use our service, so it’s not feasible to reach out to them and have them change their config. Also, there is always a chance that someone in your organization is using your service without you knowing about it. This whole thing is a lot of communication and management that we should be able to “software” it out of the way.

Strangler Fig Pattern comes to the rescue. The idea is that for the functionalities that you have migrated so far, redirect the traffic to the new service, otherwise continue to get responses from the legacy service. If your service has 4 endpoints, then the strangulation could look like this:

In the above pattern, no clients are involved whatsoever. we just redirect traffic using some sort of load balancing to the desired service whenever we need to without having to wait for a client’s deployment. This will speed up your deployments, make everything smoother, and eliminate lots of unnecessary communication.

But there are more details to be explained for each stage here. Each stage is one endpoint further down the line. This means to go from one stage to another you need to deploy the new endpoint in your new service and then redirect the traffic to it. Let’s zoom in here:

What I want is to let a small percentage of my requests coming
for `/endpoint1` to be fulfilled by the new service and leave it at that for a while. Then, increase that percentage and sit and watch.
If at any point a bug is reported, I can configure traffic shifting to redirect 100% of the traffic back to my legacy service with no deployments at all, fix the bug, and start again with a low percentage.

In practice, we use istio traffic shifting to make this happen.

Conclusion

When you migrate a service you need to have a smooth and safe transition with zero downtime without having to generate lots of work for your clients while reserving the ability to rollback smoothly in case of any incident arising. You also need to be able to have both systems co-existing peacefully together. We have seen that you can achieve safety and robustness by implementing traffic mirroring, smooth transition with gradual progress by implementing traffic shifting.

For us, mirroring had a couple of disadvantages. First, consider if your network will be saturated once you mirror 100% of the traffic. We knew that won’t be an issue so I cannot speak to the effect it would have had on your network in general. Second, our implementation of the checker service means that as long as mirroring is active the database has to handle x3 RPS. That’s because, for each request your database handles two extra requests are generated by the checker service. If a 300% increase in RPS is an issue, consider setting the mirroring percentage to a lower one you feel comfortable with. Also, the load is trippled because traffic mirroring in istio is in a fire-and-forget fashion as we have discussed earlier. We haven’t experimented with other solutions so we cannot speak to the details of other alternatives.

Shifting may add some latency but safety trumps a couple of extra milliseconds, especially when, at the end of the day (realistically, months), once 100% of your requests are sent to the new service, you can still deploy your new service to a new DNS then ask your clients to change their config accordingly if you really hate your old DNS.

As a team of Developers, it’s a privilege that we are tackling this engineering problem in a real-life scenario. istio service mesh here was a lifesaver for us! it’s super easy to configure. However, none of this would be as smooth and safe as it was if it weren’t for the Platform and SRE teams helping us. So, a special shoutout to the invisible heroes behind this setup.

We’re always looking for passionate and talented individuals to join our team. Learn more and apply from the link below.

Trendyol - Backend Developer

We were founded in 2010 with a dynamic and agile start-up spirit. The trust of around 30 million customers and 250,000…

jobs.lever.co