Code Cleanup: When Your Work Is Undoing Other People’s Work

My girlfriend doesn’t come from an engineering background, so it was interesting for her to hear that every six months I spend two weeks deleting an average of 15,000 lines of code. I could guess what she was thinking — It’s funny that part of your job is to delete what other people build.

“But why does it take ten full days, and why doesn’t a junior developer do it?” she asked.

“It’s not demolishing a two-story building with a wrecking ball; it’s defusing tens, if not hundreds, of bombs, when cutting the wrong wire can bring the whole website down,” I responded, smiling like a ninja sharpening his sword.

Mess, Productivity, and Salesmanship

Developers, like other human beings, get excited when they see a justification for their admittedly not-so-positive habits. For example, if you were a developer with a messy desk, you might forward an article with the headline, “Scientists say a messy desk could make you more productive,” to your peers and boss immediately after seeing it. It might be true that a messy desk could make you more productive, but a messy codebase won’t.

A codebase is a shared workplace. It’s where junior developers will go to learn from. At the end of the day, it’s what you ship to your customers/clients. And while you’re wrapping the code into a tidy, fully-enclosed package, still a bad smell of low quality code might come out of it!

If you don’t have a routine code cleanup procedure in your development process, you are tying yourself to increasing tech debt. You can live with it, but because of the snowball effect of tech debt, you’ll eventually find yourself paying more for the interest than the original debt itself. Yes, you can think of its APR as 40% — way more than any credit card you might have!

Similar to most credit debts, your plan might be to wait for the perfect opportunity to pay off all your debt. That’s when you’ll have enough resources (money, and/or time) to pay it back and live happily ever after. Unfortunately, that perfect day will never come. People who manage and prioritize developers’ time usually have a lot of plans for it. They do not want experts spending time on something that doesn’t have any tangible impact, because they can’t measure the outcome in dollars.

That being said, it’s your job as an engineering lead/manager to sell it. Code cleanup is a mandate. Managers should put their best salesmanship efforts on selling these initiatives to those who own the timeline and schedule.

How to Sell a Code Cleanup

If people around you are not familiar with the concept of Performance as a Feature, have them watch this video and observe how Instagram increased their daily active users without dumping any money into marketing/sales, or introducing any new fancy features. That’s a great first step.

Once you’re there, you can open a conversation up by explaining how getting rid of 10% of the codebase can reduce compile time and size of the built artifacts, thus giving a better experience to your end-users. From the development perspective, it can definitely accelerate the ramp-up time for your new hires and prevent them from being exposed to legacy patterns from five years ago. Finally, lowering the level of complexity of the system can significantly reduce the number of bugs and your maintenance costs over time.

Make It Tangible: Define Metrics

To you, as a developer, find . \( -name “*.js” -or -name “*.php” \) -print | xargs wc -l might be a simple option to count the lines of code. It’s great for you because it’s quantifiable to the maximum extent. However, it’s not good for everyone, especially your PMs and those who have hard time imagining why there are millions of lines of code for a simple website. No offense to them, they simply cannot digest super raw materials. It’s your job to chew and bake it for them.

The start time of your app/website is usually one of the best metrics for this. If it’s a mobile app, you can consider the artifact size, or the cold start time, along with the actual app start time. Less code should mean less bytes to download and less CPU operations to run. On top of that, it’s easily quantifiable and measurable. Lastly, it’s not hard to put two and two together if one wanted to see the impact on the end-user experience.

Bonus Point: If you can establish a report or dashboard that shows how these end-user metrics naturally increase over time, and how your recurring cleanup process relieves it.

Rolling Up Sleeves? Don’t Cut the Red Wire!

While I was joking while using the bomb analogy with my girlfriend, sometimes it truly is as stressful as defusing a bomb. Believe me, you will be seeing the scariest parts of the codebase that no one has touched for years. (Because it works no one has ever wanted to touch it.) Don’t fix it if it ain’t broken, right? Wrong. You need to clean it up even if it ain’t broken.

Let’s go over some best practices for this bomb-defusing mission…

Rule #1: When in Doubt, Don’t Cut the Wire

Uncertainty. It’s like looking at the engine of a car and guessing how hot it gets when the car runs at 80 mph. You never know… And while you have enforced the best programming practices throughout the years, there are still surprises everywhere. It’s worse when the developer who wrote the code for the first time has left the company and now only God knows how this system works!

My general advice for such cases is: Do Not Over-cleanup. This whole work is a tradeoff — you’ll clean up 100 more lines knowing that there’s a 0.1% chance that something catastrophic at a random time will break. Obviously it’s more difficult for perfectionists, but in terms of risk, it’s more efficient if you just give up on the last 1% of lines you want to clean up. It’s really worth it. Imagine how mad the office will be when you mess up the giant neanderthal machine that was working, albeit slow. Now it’s dead and YOU are in charge!

The Pareto principle (a.k.a. rule of 80/20) is your friend here. You can even go with 90/10. Try to diffuse 90% of the bombs with 10% of the time/effort and 10% of the risk. Dropping page-load time (or app-start time) from 10 seconds to 5.5 seconds with the risk of R is going to be much more appreciated than going all the way to 5 seconds with the risk of 10 times R.

Bonus Advice:

If you ever run into stage 6 of debugging, you might want to suppress your adventurous ego and take a step back. You are not responsible for making the whole world a better place, right now! Some of the items can wait until the whole service/repository is replaced with something more modern.

Rule #2: Focus!

FOCUS! Make sure you are working during absolutely no-distraction times, and bringing your most conscious brain! You don’t want to make a silly mistake just because someone pinged you in Slack and you accidentally removed the wrong half of an “if (! dontPreventFromNotDoingX) { … } else { …} “ code. Remember, you are doing open heart surgery, so focus, focus, and focus!

Rule #3: Don’t Hit and Run, Dive Deep!

Use a good editor and run DFS instead of BFS through your defusing mission.

If you see a big “if (x) { /* Block1 */ } else { /* Block 2 */ }” and you know that variable x, which might be an A/B testing flag, is final to be true then don’t blindly delete the Block2 section. Keep going deeper. Is there any function F called or new component C created in the Block2 section? If so, then find usages of those functions or components in the entire codebase. If those usages are only called once from this section, then you can delete them as well. Again, before just deleting F or C, run the same check. Sound familiar? You are basically performing reference counting for recursive garbage collection work.

Which algorithm should you use on this tree of references — BFS or DFS? Well, if you do DFS then the number of opened tabs in your editor would be your primary stack of recursion. On the other hand, with BFS you would need a global reference-counting table. This would make it very complicated to manage using tabs in an editor, and it could easily grow. If you think of the implementation of BFS using queue/array, then the number of tabs open in this case would be the number of unvisited elements in your queue. This can be huge on a low-height, high-outdegree graph.

Finally, if you are using version control software, another trick is to look at the revision history of the file to figure out what the changed areas are within the same commit. It’s not always ideal, but can give you good insight.

Rule #4: It’s Not a Massacre, It’s a Job for a Marksman

If you’re not using version control software, then go get some! While it’s cool to come out as a superman and yell “tada!” it’s very unhealthy for this job. Break cleanups into smaller chunks, both in terms of commit and ship to production. This way you’ll minimize the headaches you’ll cause if anything goes wrong after the roll out. (Which might not even be your fault. For instance, it could be a coincidence that the marketing spend went down so you’re getting fewer users than a month ago.)

On that same note, make sure there isn’t any major refactoring going on. Otherwise, resolving those merge conflicts and keeping the branches up-to-date is going to be a nightmare!

So, TL;DR: Go with multiple independently revert-able commits.

Bonus Advice: If a major rewrite of the system is on the roadmap, don’t waste your time. Use your sales pitch for buying a new car instead of raising funds for replacing this broken bike.

Rule #5: Communicate! There Might Be Cross-Team Dependencies.

Regardless of whether you’re using RESTful API, GraphQL, or something else, there is usually a dependency between client-side and server-side code. Clients can easily delete a piece of code that they don’t need, but on the server-side they probably need to support legacy versions of the native apps and their requests.

Same can go between different repositories on the same side — Is there is CSS dependency in the HTML code that you’re cleaning up? Is there a build-job related to the module you just removed that’s owned by your horizontal architecture/platform team?

How about documentation? Or the patterns that designers use for different components? Let them all know once you’re done and it’s out. This will help the other teams start playing their own part of the orchestra.

It’s Not Done Until it’s Done Done

Alright, you made it to production and you’re so excited that both the load/interaction time is faster, and you have more users hitting your app/website. Great! We were too.

When we rolled out our massive clean-up, which was getting rid of 20% of the codebase in charge of more than 100 finalized A/B tests, we got a 15% to 20% reduction in page-load time and a boost of roughly 10% in the number of successful landings on our website. You can think of it as going from 5 seconds to 4 seconds, which means that all the people who would wait 4 seconds (but not 5 seconds) for our page/app to load were now seeing it! That was also in line with studies done by Amazon, Walmart, Google, etc.

It’s not the end yet! You have to do some post-cleanup work too.

Post Cleanup Task #1: Share the Results

Let’s be honest, the first time you heard that “10% faster = 10% more visits” you were skeptical! So was I. And, chances are there will be a lot of other people in your company (and even on your very own team) that won’t believe in the case you’ll be making either.

That makes this the best time to share the results in detail and help them believe!

Also, when comparing to other engineering initiatives and refactorings, this one usually yields the best results. Seize the moment and show this as a proof that not all improvement in end-user metrics come from Marketing/Sales and/or new features.

Post Cleanup Task #2: Monitor, and, Even Better, Set Alerts!

In order to show the impact of your cleanup job you’ve definitely defined some metrics that you’ve been tracking. Make them real-time (or close to), and share them with all the teams (not just your own). It’s even better if you can set up alerts, just in case something goes wrong.

This is what we did at Zoosk: When we implemented header bidding for our advertisements those metrics/alerts immediately notified us of a problem. The improvements in performance monitoring could also be considered a side-win of the project.

Post Cleanup Task #3: Set up a Cadence or Threshold for Doing it Again

Subscription services are proven to be successful business models. Specifically, Zoosk has been successful as a subscription based online dating platform. So sell your PMs on a code cleanup subscription!

Instead of giving them the same presentation every six months, you should agree on a deal. For example: When a certain metric gets to a set threshold, your team will automatically get one week of one developer’s time to do the code cleanup. The threshold can be based on the size of the page/artifact (what the end-user needs to download on their client to run the website/app), lines of codes, or number of A/B testing flags the code reads.

At Zoosk, we’ve gone with the third option. We have an enum for all the A/B test flags in our client side code, and we made a deal that whenever the vertical scrollbar appears on that file it’s time for another bomb defusing game!

PS. I spoke at FutureStack 2016 about this in a talk entitled, “Love Can’t Wait!” Feel free to watch the video of it here, and find the slides here.