Practical steps to refactor legacy code base

Published in

The Startup

10 min readApr 27, 2020

Problem

Working on legacy code base is the reality with which most of the developers have to live with. It can be painful, frustrating and demotivating experience. I am sure many developers fantasize that one day they will bravely pick up a fight against this monolithic legacy daemon, find out how the giant works secretly behind the entanglement of interconnected dependencies, and confront the generic module used by hundreds of components, which in turn create havoc in every corner of code base. This dream of conquering and taming the daemon generally begins when a developer is new to the code base, but frequently this desire to confront gradually evolves to reverence. One starts to accept the daemon, even at times be in awe of how it just works and compromises with the supremacy of the monolith.

This is not the best situation to be in for either developers or for business. The problem is not just technical but almost always it has business impact. Such a code base makes it difficult and slow to:

add features
modify existing behaviour
fix bug
or implement any ambitious task

The underlined cause for these challenges is most commonly lack of understanding how the existing code base really works. This makes it also difficult to incorporate TDD refactoring practices as one doesn’t know what the tests should be to start with due to lack of domain knowledge.

In this article, I will propose pragmatic way to refactor legacy code base in small steps, where each step can be released and tested. The goal is to make code base simple in every iteration. Ideally, such a refactoring can be a parallel task without requesting business to halt change requests for the module while its refactoring is in progress. Refactoring in small releasable steps will not only help developers to get early feedback on their refactor and find possible bugs but also allow them to incorporate business requests. This makes taking out time for refactoring more feasible and pragmatic.

Common issues found in legacy code base

The causes for non-clarity of the legacy code base can be many including:

use of a common library or method or class which is used by many components.

The reusability in itself is not a bad sign. Generally, when such a reusable component is written, it makes sense for its callers. But, with new stories and use cases, the common utility methods/classes are hacked to adjust new requirements. As a result, you may get a generic component with multiple parameters, which do not belong together. These parameters may have strange inter dependencies and multiple ‘if-else’ checks. To the reader of the code, the ‘states’ may not be clear, and code just looks like a black box which somehow just works. As a result, you get a very delicately balanced code which is non readable, non-understandable and hard to update without fear of side effects.

monolith classes which performs multiple things and has mutable states.

It is not uncommon to see huge classes which has thousands of lines of code. This problem cannot be associated with bad developers or only to non-sophisticated technology companies. This problem can also be seen in reputed open source projects with hundreds of collaborators and strict review process. Like previous issue, the monolith doesn’t get created at beginning. With time, the logic in the file keeps growing. The length of the class may not be that big an issue unless it strictly adheres to having immutable states. The mutability of ‘states’ creates problem where developers need to know how the states are updated. As a result, one faces with the challenge of knowing exactly in which sequence ‘states’ have to be set before using it. The unclear and mixed dependencies along with mutable states makes it difficult to read, understand and change a monolith class.

lack of clear architecture

It can be argued that above points are side effects of lack of clear architecture. The good architecture ensures that classes are focused, their dependencies are properly passed and there is a predictable way in which modules interact and work together. The importance of good architecture cannot be understated. It is definitely an important aspect of readable and maintainable code. But before an architecture can be applied, first we need to understand the internal entanglement. The understanding, disentanglement and simplification of code base is a prerequisite of applying a solid architecture, because we must first need to understand how code works before architecting it.

Goal

Refactoring legacy code base is a challenging task. Many times, teams are inclined to postpone it due to many reasons including:

need to do large rewrite which may break existing features
very time consuming
success cannot be measured easily
domain knowledge of all the modules using the legacy block may not be clear
incorporating change request in the work in progress module may be considered not possible

The simple steps to refactor mentioned here will allow developers to progress in simple iterations and keep releasing them. Such kind of refactoring will make it possible to pause and resume refactoring seamlessly. Also, it will help developers to gain early feedback while releasing stable continuous improved code base.

Simple steps to refactor

Step 1: Identify a flow to simplify.

Refactoring a legacy library or module can be overwhelming and huge task. A legacy module serves possibly multiple flows and use cases. The first step in refactoring is to scope down for first iteration.

The refactoring can be done on focusing one ‘flow’ at a time.

For example, consider a messaging library. It can have several flows/use-cases like getting messages, sending messages, loading smart-reply options, etc. In first step of refactoring, focus on one flow at a time, say getting messages. There is a possibility that refactoring for this flow may need some changes which might affect other flows, but we will talk about these challenges in the next steps, where we discuss the refactoring approach.

Next step in identifying flow is also to list down the major classes which are part of flow.

For example, if the refactoring is done for a messaging application, then interesting classes can be: ‘view’ class in which messages are shown, ‘repository’ or similar class which fetches messages or processes the messages, etc. Identify the ‘monolith’ classes in the flow and painful generic module involved in the flow.

Step 2: Split monolith classes in the flow

This step requires splitting large classes in previous exercise to smaller classes. Make a release after splitting each monolith class.

For example, consider an ‘Item-Description’ page of an e-commerce product. Suppose, this page is implemented by one ‘View’ class (say, ItemViewPage). It may have logic to:

request item detail from network
use response to show multiple things such as title, price, pictures, description of product, reviews, date of posting, etc. in a specific format (e.g. images might be shown as swiping pages, price is shown by prepending currency sign before the amount, date as per user’s locale, etc.)
orchestrating next actions (e.g. requesting recommendations based on some criteria, mark the product as viewed after certain time interval, etc.)

This kind of classes is doing multiple things which can be separated into focused classes. These can be based on following functionalities:

request item detail from network
different view controller classes per widgets, for example separate widget class for title, price, description, pictures, etc.
requesting next actions like recommendations, etc.

The approach to split can be started by engineer’s favourite tool, ‘cut’ the code from monolith class and ‘paste’ it to new file. The new classes should follow ‘Effective Java’ principles such as

Single Responsibility Principle — classes should handle single functionality, to avoid any coupling between functionalities.
Dependency Inversion Principle — Inject immutable dependencies in constructor or method. This will make it easy to test and prevent side effects from other classes.
Favour composition over inheritance — For example, suppose there is tracking that is associated with each view widgets. Instead of making a class which inherits ‘trackable widget’, use composition (instead of inheritance) to implement ‘tracking’ feature. This helps each class to be ‘final’ (immutable) and still use that class functionalities. Also, if the composition class is passed as dependency (and it should), then it will also be easy to test the classes.

In summary this exercise should include following steps: split a monolith class into focused separate classes. Don’t forget about tests. Release the refactor.

Splitting one monolith class can significantly improve readability and understandability. The act of splitting helps one to get familiar with the code base of entire monolith class. The exercise will also identify dependencies between components. For example, suppose there is a feature to ‘favorite a product’ and based on that, title of the product should have different colors. This example demonstrates that ‘favorite’ state determines behavior of both ‘title’ and ‘favorite’ view widgets. The split class following above principles will work agnostic of other class and will operate on dependency passed to it. For the title widget, it will be title text and whether product is favorited, etc.

This exercise includes splitting the monolith file without being restricted to only the ‘methods’ necessary for the ‘flow’ identified in step one. The splitting of monolith helps developers to get better understanding of code base, improve code readability, and help them plan with next steps. Also, it prevents the issue of having how to improve only methods necessary for the flow, which might have common dependencies with other flow.

Step 3: Extract logic for your flow from reusable module

Let’s look at this step with an example. Consider a reusable module which displays pictures in ‘pager’ format (that is you can navigate between images by clicking/swiping in opposite directions). The module could be used by many components. Consider that the module performs multiple tasks such as controlling whether image can be swiped, show image’s page index, deleting or favoriting a picture, tracking a picture, allowing to rotate picture, showing relevant advertisement below each image, finding most used color of the picture, recommend filters, etc.

Many of these functionalities are not really generic. For example, showing relevant advertisement or applying some kind of tracking or applying filters may be applicable only for a specific use case. The issues with overly generalized modules are that they are misused by adding logic during its evolution, which is applicable only for specific scenarios. These introduces unnecessary states, dependencies, operations and conditions to maintain, which in turn makes it hard to read, maintain and change.

A good heuristic for defining a code worthy to be in a shared library/module is when it is needed by at least 3 use cases. Other way to think about whether a code belongs to a library or shared module is by answering question — is the code reusable enough to open source. If not, then it means that probably it has some internal states and dependencies, and probably can be simplified.

However, this step doesn’t require developers to simplify a generic module as it can be too complex because one has to know all the use cases where it is used and check for those side effects (assuming existing tests are not written effectively which is a common scenario).

In this step, the idea is to extract the logic that is relevant to the flow identified in step one. Then, rewrite it in simple and clean manner. Don’t forget to cover it with tests. Release this refactor.

For example, if you are focusing on flow for showing ‘Item Description’ page, then look into the use cases needed for your flow. Extract the code needed for your flow into a separate file and cover it with test. Build it in a scalable manner (effective java principles), so that dependencies are passed to it and states are immutable. This class can be later used to build a refactored shared module in next iterations. But, for current iteration, focus on just the part that makes sense for your flow.

Step 4: Apply an architecture to the flow

From the above two steps, important monolith classes will be split, convoluted generic module will be split as per the use case of the flow. By this stage, it is expected that developer understands clearly the requirements of the flow, corner cases and working of different components in the flow. The flow at this stage is supposed to be simplified, split into focused classes and covered with tests. Since, these components of code are also released, it should give enough confidence that the refactoring done until now works without any side effects.

The next step is to apply an architecture of your choice.

The domain knowledge gained in previous exercises will give enough insight which kind of architecture will be best suited for the project. Also, smaller focused components created in previous exercise will be handy in applying any architecture. Like in previous steps, look for early opportunities to release so that you can learn from any potential mistake as soon as possible.

This step will require to rewrite significant portion of code, which might include reorganizing classes in some separate ‘module’ (or package), changing behavior as per the need of architecture (e.g. denoting classes as Repository, Presenter, Controller, etc.). Tests might need to be rewritten.

But this step will take significantly less time if it done at this stage, compared to if it was chosen before previous two steps. Also, it is less risky as developers is aware of all the use cases. The code at this stage is in better manageable state to be architected.

Step 5: REPEAT

Repeat the above steps with new flow. While repeating, developers may feel enough confident to understand all the use cases of the generic module identified in step three. The continuous iteration is a good opportunity to refactor and simplify them too.

Conclusion

Generally refactoring is considered a long running task which can take months. Typically, it is done via one huge change and release. Developers have to inform businesses that feature-request in that module can’t be implemented due to ongoing refactoring. This makes it difficult for the team to allocate enough time for doing needed refactoring. Even if the team gets to do those refactoring, due to big changes made in refactoring, there is high risk associated with side effects. These kind of side effects which are produced as a result of weeks of work are hard to fix (for example, what if their entire architecture missed an important aspect).

The simple steps mentioned in this article will allow developers to refactor legacy code base in simple iterations and keep releasing them. Such kind of refactoring makes it possible to pause and resume refactoring seamlessly. This will help team to incorporate important business request in the middle of refactoring, as each iteration is supposed to be stable release.

This is the same idea which is used in agile development for product. But tech improvements or tech refactoring stories are traditionally still not done with agile approach.

In essence, steps described in this article is about thinking and applying agile approach to refactoring so that continuous progress and deliveries are visible.