The Tempting Trap of Illusory Duplication

The nuanced topic of writing the same code twice

Ryan Palmer
Decoupled
3 min readFeb 4, 2020

--

The proclamation of duplication as an undeniable evil in software is not one oft-contested. It’s not hard to make the case that writing the same thing twice is wasteful, redundant, and perhaps even antithetical to good design. What’s hard is making the case that, sometimes, duplicating code is the correct design decision!

“There is true duplication, in which every change to one instance necessitates the same change to every duplicate of that instance. Then there is false or accidental duplication, If two apparently duplicate sections of code evolve along different paths — if they change at different rates, and for different reasons — then they are not true duplicates.” — Robert C. Martin, Clean Architecture

Wow, mind blown. Duplication is not always wrong? That is revolutionary thinking! But actually, it makes perfect sense, and most of us know this subconsciously. Have you ever written some specialized data structure, or wrapped a library class to serve some slightly more specific purpose? Probably. Did you think what you were doing was groundbreaking? Did you envision yourself patenting your invention, publishing it in a prestigious scientific journal, and ultimately receiving the Turing Award in recognition of your unassailable ingenuity? Probably not. You created a convenient solution for a specific problem — a domain-specific problem. The same solution likely exists elsewhere, but for a different problem in a different domain.

A row of matching houses; superficially similar, yet undeniably independent.

Let’s say you need to describe the “size” of strings in three categories: small, medium, and large. Your requirement states that strings under 1,000 characters in length are considered small. Between that and 100,000 they’re considered medium. Beyond that, they’re large. For your project, you will use these classifications to decide how to persist data. Small data will be cached, medium data will be written to flat files, and large data will be stored in a remote database.

After spelunking through the codebase for some time, you discover that an engineer on Team X has already created a class called StringSizer for their project. As if by cosmic destiny, they define their sizes the exact same way! Surely, any diligent programmer would not rewrite the exact same code. No, that wouldn’t be DRY. Instead, you reuse the StringSizer from Team X and proceed happily to complete your project. The code works, the stakeholders are satisfied with your deliverable, and the feature ships.

Fast forward to the next release, and Team X has received some feedback from the end-users. Their feature, an archiving tool that uses StringSizer to provide a user-friendly description of log files, has some new requirements. Turns out, most of these log files are much longer than 100,000 characters. Some are even hundreds of millions of characters long! They quickly get to work updating StringSizer with the new specifications: Strings under 1 million characters are small. Between 1 and 100 million, they are medium. Above that, they are large. Again, the code works, the stakeholders are satisfied with the results, and the update is released to the customers.

Suddenly, complaints start coming in of crashing servers, out-of-memory errors, hard disks flooded with gigantic temporary files, and surprisingly low database activity. Team X’s change has caused your feature to flood the heap and the filesystem with an enormous amount of data. Ouch!

The problem is, even though the code was exactly the same, the requirements were never the same. The high-level policies of these two projects were never aligned. What appeared to be duplication was just an illusion — a dangerous illusion indeed.

“Resist the temptation to commit the sin of knee-jerk elimination of duplication. Make sure the duplication is real.” — Robert C. Martin, Clean Architecture

--

--