Taming the Beast — Refactoring to Empower Teams

Pedro Arvela
Onfido Product and Tech
5 min readAug 28, 2020

One day I stumbled onto a ticket from the website team that was open for several days. The ticket was for a simple change, to enable file compression.

The comments went back and forth. First we would try something, then we would ask someone else to test. Then they would come back and say it didn’t work. This repeated 3 or 4 times until everyone gave up.

I was confused, the request seemed easy to do. I decided to investigate, and that’s when I found The Beast.

Unraveling The Beast

To understand The Beast, you need to understand what would happen at the time when you opened onfido.com on your phone.

A request would first go to Kubernetes, then an nginx server, then a Cloudfront distribution, and finally to an S3 bucket!

Here was the beast

Each step on this chain had a different owner, was in a different project, and had a different release process. Each team had to verify that changes to the website didn’t break their main service. It took time to make changes.

Not only that, it was also slow. US users had to wait twice as long as European users to see the page.

How did it get to this

To understand how we got to this beast, we need to go back to when it was just a tiny puppy.

Many years ago, the website was done in Rails, and its code was part of our platform. And all of this was behind an Nginx instance.

Life was simple for the young puppy

The developers managed the website, the Rails app handled routing, the Nginx proxy handled requests and caching, and the user saw a website.

Eventually one Rails app became many Rails apps. Nginx started serving the static assets from a folder. The puppy’s fur grew and nobody was there to trim it.

A couple more years and the website became separate, so the marketing team could own it. But because all the paths were intrinsically linked, it continued to be served by Nginx.

A bit more time and Kubernetes was added to the mix, and all of this was moved inside it.

The puppy’s fur grew and grew, and nobody gave it a trim. The puppy grew to become The Beast.

Taming The Beast

So how do you go towards taming The Beast?

We had to understand how that devilish Nginx configuration came to be. We had to understand what we had to keep, what we could rearrange and what we could throw away. We wanted to understand what the user workflows were.

So we talked with teams to see what they knew about this beast and evaluated the access logs for the website. With that precious information in mind, we were able to propose a new architecture that removed the website out of the way of other teams.

In this new architecture, we start with a Cloudfront distribution, this distribution has one single goal, to see what goes towards the website and what goes towards Kubernetes.

Proposed taming of The Beast

With this new architecture designed and the approval of all the teams, we prepared our new test environment. We made it match the behaviour of the real website as much as possible.

With a few days of trials, we got confidence that we could proceed. We did a dark launch of production only in our office network. Teams did their final end-to-end tests. Minor hiccups were fixed and we got ready for the big switch.

Launch Day

It was early morning, the release scheduled to happen in just a few hours. There was some tension. This could have a large impact, both on our image as well as for our clients.

All internal stakeholders were gathered, rollout procedures were ready and rollback procedures were prepared in case of a disaster. Eyes all on the various graphs and monitors. The pull request had one single change, onfido.com stopped pointing to the Load Balancer and would instead point to CloudFront.

The commit was merged, the CI ran and the changes were applied.

We waited.

No alarms were sounded. All services were normal. The website looked exactly the same. Nobody had noticed the change.

Exactly as intended.

Results

So, visually it looked the same. What changed?

First, we finally could fix that ticket from the start. And not having to go all the way to Europe on every request made the average page loads in the United States go from 5.59 seconds to 2.01 seconds.

Page load times from the weeks before and after the switch, even depending on the internet connection of the person the load times fell by over a second worldwide

But even if all of these metrics continued the same, the changes would have been worth it. They enabled our website team to act independently.

With the smaller and self-contained code base, the team was able to own it. They went from opening tickets to change something to opening pull requests themselves. The devops team has less load and the website team can change things fearlessly.

Key Takeaways

When faced with a large scary beast, the first instinct is to run! But that beast will not go away! It will terrify everyone who passes by and any change will take an eternity to do.

The proper action is to be fearless and tame the beast! When it is no more than a harmless puppy, teams will be more confident to make changes and will act quicker.

It’s a win-win for everyone, and all it needs is that initial push for change!

--

--

Pedro Arvela
Onfido Product and Tech

DevOps Engineer at @Onfido | Past: DevOps Engineer at @Unbabel | Enjoys Travelling