Behind the new WeTransfer: why strangulation works
I often talk to engineers about what we do and they tend to be surprised at the amount of engineering that goes into our service. You see, even though our proposition is simple, our scale means there is considerable complexity in our systems. Handling one billion files per month is not so easy. At that volume, the tiniest bottlenecks become serious problems, and consistently shaving two milliseconds off a transaction has a profound impact. This journey of scaling and optimising has led to an intricate set-up of interconnected applications that work together to provide a service that keeps our users in their flow.
When you combine our enormous user base with our relatively small staff size, I feel that we are at an extremely exciting growth point for engineers. We are small enough for a single engineer to have a huge impact on the product, and big enough (in terms of our users) for that impact to be meaningful. You ship something great and complimentary tweets come streaming in within seconds.
Over the past month, we have been slowly rolling out a redesign of the WeTransfer web interface. While this is a huge milestone for us, for the engineers here it was one of the last steps in a much larger project to modernise critical parts of our infrastructure and catch up on technical debt (ie the shit you should have fixed ages ago, but then life happened).
When I joined WeTransfer in 2015 I found a service that was growing faster than ever and a codebase that was in dire need of some TLC. The team consisted solely of Ruby engineers, but several critical satellite services were written in PHP. They hadn’t been touched in three years, and were beginning to crack under the increased load. We had to do something.
Replacing busy legacy systems is always a tricky thing. And when you have no documentation, no language experts and little time, it’s even worse. The only viable solution is strangulation, and that is exactly what we did.
Strangulation is the strategy of creating a second system next to the already functioning one that is 100% API compatible and identical in how it applies its business rules. You slowly shift load to this new system, tuning and tweaking as you go, until the new application is handling all of the traffic and you can switch the old application off. In our case, the opportunity to watch how the application scaled under real-world load was priceless, as it gave us an almost risk-free way to test and optimise the shit out of this thing.
This strangulation approach is not sexy. It is quite the opposite in fact as it comes with the careful replication of old stuff and will force you to recreate some of the architectural design choices that you have been ranting about for years internally. It comes with a lot of reading and trying to understand someone else’s old code, usually some swearing, and brief moments of reimplementing the damn thing. You might even find yourself replicating bugs in the old system, just so that you can build something that works in exactly the same way.
But all of this pain is worth it, simply because you know that it works. It is only after you have paid the technical debt that you are in a position to modernise that API and turn that damn GET request into a POST which it should have been in the first place.
You’ll be doing it in a pristine code base that reflects the direction that you envision for the product. And of course it has tests and docs in abundance (right?).
Over the last year we have rewritten three services in this way, and all of them are now running 100% of our traffic. The old systems have been turned off and bid farewell.
We’re now in a position where we can again look towards the future, and towards delighting people with new features. And that is where the redesigned UI comes in, as a catalyst for delivering these features to our users.