Photo by Chris Ried on Unsplash

Refactoring Duplicate Code

Engineering Insights

Talin
Published in
9 min readAug 30, 2020

--

As a software engineer working on a large project, you’ll likely be asked to do cleanup work on the code to make it more readable and more maintainable. There are several different ways in which code can be reorganized, but one of the most common refactoring tasks that you’ll be tasked with is elimination of duplicate code.

In fact, the very word refactor sort of implies elimination of redundancies. That is because the term originally comes from the word factor in the mathematical sense. When we “factor” an equation, we pull out repeated usages of a variable name and combine them into a single instance, so that an equation such as a × b + a × c becomes a × (b + c).

There are a number of benefits to doing this.

One obvious benefit is that it makes the program smaller. Less code means less effort needed to maintain and debug that code.

A somewhat less obvious benefit is that spending time to optimize or improve common, shared code yields a higher payoff. If you have a nifty sort algorithm that is used in 10 different places in your program, and you make it twice as fast, you’ll get a larger overall benefit than if that same sort algorithm was only used once. Similarly, if you fix a bug in a common piece of code, the benefits of that bug fix apply to every user of that code.

It’s easier to think about this benefit if you adopt the perspective that code is always changing and evolving.

You see, when I was a young and naive programmer starting out on my career in the late 70s, I was greatly impressed by the beauty of the idea that code was “forever” — once you solved a problem by expressing that solution in code, it was solved for all time, there was no need to go back and revisit that problem again.

However, I later learned that I was dead wrong. I realized that code is not a static artifact — it’s not runes carved in stone or mathematical formulae. Code is constantly in motion, being improved and updated. Code changes because the world changes, and code has to keep up. There are a few “eternal” algorithms and data structures that have lasted for decades, but even those have to adapt to new programming languages and machine architectures. In fact, I have grown highly skeptical of code that never changes — this is often a sign of abandonment rather than success.

If you think about a function or a class as an ever-evolving stream of improvements rather than a timeless monument of intellect, you can easily imagine that it is much easier to manage a small collection of evolving entities than a larger number of them. Thus, refactoring duplicate code can make this evolution easier.

Some programmers are pretty obsessive about eliminating duplicate code — whenever they see two blocks of code that look suspiciously similar, they get a kind of itch that needs to be scratched, and the way to scratch it is merge those two blocks of code into one.

Refactoring isn’t always easy

However, eliminating duplicate code is not always as easy as you might imagine. There are several challenges which can occur.

Probably the most common source of difficulty is where you have two blocks of code that are similar, but not identical. In a case like this, you’ll need to generalize — that is, you’ll need to come up with a new function that can handle the work of both of the old code fragments you are trying to replace. Generalization is something that requires creative thought, it can’t just be done mechanically. The more different the two code blocks are, the harder it will be to come up with a general solution that fits both — especially if those differences aren’t merely superficial deviations but abstract and fundamental.

It’s more difficult to generalize over differences if those differences are ones of structure or type, rather than differences of value. The reason is because types can’t be passed in as parameters (well, they can be passed in as template parameters in languages that support generics, but most languages have limits on what templates can do — see type erasure), and structural differences can’t be passed in as parameters at all. Thus, if you have two functions that look similar, but one operates on scalars and the other operates on arrays, you are going to have a hard time combining them into a single function that efficiently handles both.

Another problem you may have is that the blocks of code may be located in places that are hard to access — deep inside of another class or function, or even in a separate library. Merging those code blocks will require moving the code to some common context where it can be reached by all.

The code in question might also have a lot of dependencies from the context where it lives. For example, a block of code inside a function might depend on local variables defined within the scope of the function body. Refactoring that common block into a separate function will require passing in all those variables as parameters instead. If there are too many such parameters, then calling the refactored common code will be cumbersome to use.

Up to this point, we’ve talked about eliminating duplicate code as if it were always a good thing — perhaps challenging at times, but always worth attempting. But is that really true?

In fact, there are cases where eliminating duplicate code is not a good idea, and I want to outline some of those cases.

Divergent Destinies

Remember how I mentioned that code is ever-evolving? Well, you might have two functions that look identical now — but those two functions might be on radically different trajectories. In other words, those functions might be very different in a future version of the code, and tying them together prematurely might stifle the evolution of one or both of them.

A real-world example I can think of: say you have a simple function that appends strings to a buffer. And let’s say that you discover that there’s another function, in another part of the code, that does nearly the same thing. So the logical thing to do is to combine them, right?

Except I haven’t told you the whole story. You see, these two functions live inside of a multi-threaded environment. One of those functions is called in a way that is always thread-safe — that is, the function is only ever called by one thread. The other function, however, is called by many threads.

Now, you notice that this second function is made thread-safe by the fact that all of the calls to it are guarded by a mutex (mutual exclusion) lock. So no problem, right? We can merge the two functions, and then multi-threaded callers can guard any calls to the string append function by acquiring the lock first, while the single-threaded code doesn’t need to bother with locks. So they both can share the same code…right?

Except…mutex locks are expensive in terms of CPU time. Using a mutex lock may be good enough for now. But at some point in the future you might want to switch to something better and faster. This might mean re-writing your string append function to use a “lockless” algorithm that takes advantage of atomic CPU instructions. That’s a much more complicated implementation that requires a total rewrite of the function, but it’s lots faster that trying to acquire a mutex every time you call it.

However, even atomic instructions aren’t free. They are much faster than a mutex, sure — but they are slower than using regular non-atomic instructions.

If you kept these two functions separate, without refactoring them, then there is no problem — the single-thread users can continue to call the fast, non-thread-safe fast version. The multi-threaded users can call the thread-safe version that uses atomic instructions, which is slightly slower, but those callers are willing to pay that minor cost for thread safety.

However, if you combined those two functions into one, now you have a problem. If you re-write the function to use atomic instructions, it will speed up all of the multi-threaded users, since they no longer need to use locking— but it will slow down all the single-threaded ones. By making our function try to serve too many needs at once, we have achieved less than optimal performance.

How can you avoid scenarios like these? The problem is that you can’t predict the future of a given block of code just visually inspecting it. You can, however, get a sense of what changes are likely to happen in the future if you are intimately familiar with the development plans for your whole project. Using the previous example, if you know that thread safety and performance improvement are important goals for your team, you can then predict that someday someone will want to migrate mutex-based code into atomic, lock-free code. This in turn guides your decisions around refactoring that code.

Muddled Meanings

One bad habit I sometimes see is programmers refactoring code based on the incidental details of implementation rather than a clear, common definition of what the code is supposed to mean.

Suppose Dave and Alice each have a list of chores. Dave’s list looks like this:

  • Feed the cat
  • Write a letter to Mom
  • Do laundry
  • Check bank balance
  • Organize bookshelf

While Alice’s list looks like this:

  • Pay bills
  • Write a letter to Mom
  • Do Laundry
  • Check bank balance
  • Clean the BBQ grill

A zealous programmer might look at these two lists and spot the fact that items 2–4 on each list are the same. Why not eliminate the duplication by combining those three steps in a separate function?

The problem is that there is no logical connection between tasks 2, 3 and 4. A function that performed these tasks would be logically incoherent — what would you even call it? You wouldn’t want to have a name like WriteLetterToMomAndDoLaundryAndCheckBankBalance. You can’t call it DoChores, since there are other chores that are not included. DoMiscChores or DoCommonChores are also bad name choices because they are vague about what the function does.

Unless you can find some unifying concept or principle that defines the meaning of the newly-refactored function, then by refactoring the code you have actually made the program harder to understand rather than easier — which defeats the purpose of refactoring in the first place.

Good naming really helps — if you can come up with a succinct name for the function that accurately describes what it does, then that’s a good sign that you have identified a clear semantic meaning for the refactored code.

Readability ROI

There’s one other aspect to consider when refactoring code, which is that indirection has a mental cost. To get a full understanding of this cost, please refer to footnote one¹ of this essay.

You see what happened? Your reading was interrupted and you had to scroll down to the bottom of the article. The same is true for function calls — there are many instances where the meaning of the code is not clear unless you read each of the executable statements in order, and refactoring things into separate functions can make reading harder.

This is especially true in certain contexts where the sequential nature of the code serves as a kind of log or document — for example, unit test code tends to be very linear, where the various assertions and expectations listed in the test case can also serve as a set of bullet points outlining the programmer’s assumptions. Fancy refactoring can detract from this, making the code harder to read as a linear document.

As with the previous section, having good naming can mitigate this to a great extent. If the name of a function is descriptive enough that you don’t actually need to look inside of it to know what it does, then you can still read the calling code sequentially without excessing jumping around.

Conclusion

Many of these pitfalls that I have outlined can be categorized as “overfactoring”, a term I sometimes use to mean an overly-zealous approach to refactoring, especially in cases where code looks superficially similar, but in fact has deep differences. There is often a lot of pressure in software teams to reduce duplicated code, even in cases where such factoring is not warranted.

So as with most things in software, it’s a balancing act. Eliminating duplicate code is a good thing in the majority of cases; but be careful not to take it too far.

Footnotes

  1. All human writing systems allow text to be scanned and interpreted sequentially, however the mechanism of footnotes breaks this pattern, requiring the reader to consume the text in a non-linear fashion, “jumping around” from place to place. In literature, footnotes are generally only used for non-essential text, so that the reader has the option to skip over them if they wish.

See Also

--

--

Talin

I’m not a mad scientist. I’m a mad natural philosopher.