Managing technical debt in data

5 min readSep 18, 2018

In the data world, I find it easy to get comfortable with the current stack and setup. We’re often tempted to keep things as they are because any change can involve a serious amount of modifications and time.

It’s important to remember something: technical debt is accumulating.

“Until debt tear us apart printed red brick wall at daytime” by Alice Pasqual on Unsplash

Scalability

Scalability is something to think about in every aspect of programming. I find it even more important in data.

If you’re working in a (quickly) growing company, it’s very likely that you won’t be able to predict how much data your system will need to handle in the next 2–3 years. You’ll often underestimate it. And if you don’t, you’ll probably start building something that is way above the current requirements. You’ll be shipping late in an environment that requires projects to be kickstarted rapidly.

There’s a balance to find. For me, generating technical debt is completely OK. You just need to be aware of it and address it as soon as it gets in your way. Never let it grow too much.

Refactoring laziness

In data (probably not only in data), we’re usually lazy when we truly need to refactor our code, upgrade our system, migrate our data etc…

The source of this laziness is mostly due to the data volume we’ve accumulated. We know that we’ll have to move it around or transform it and that it will take hours … maybe days. We know we’ll potentially be freezing all data access for the company during the process. People will let us know things don’t work, we’ll feel bad for it … There are too many constraints to keep your motivation up.

After all … things work now.

Technical debt is accumulating, laziness as well

“Panda relaxing on rock” by Elena Loshina on Unsplash

It’s alright, we all felt this way at some point.

The big problem here is that our laziness is proportional to our technical debt. Until we reach the point of no return, we’ll be accumulating both.

And as generating debt is OK, accumulating debt is very problematic: we need to act.

We need to find a way to continuously manage technical debt so that the upgrade steps are lighter, the data migrations smoother and the lines of code to refactor fewer. Our data stack needs to evolve as fast as our environment, with baby steps.

Allocate time for maintenance

We’re in a growing startup or environment, we want to work on awesome creative projects at the edge of our technical field that will challenge our skills and amaze our clients. And we should be, it’s exciting!

If we don’t free time for the boring tasks of maintaining and upgrading our system, we’ll find out that these projects won’t be as exciting as they are now. We’ll be working in an environment that we’ll feel was designed decades ago. To get us where we want to be, we’ll have to be going through data transformations that are everything but fun.

This can lead to a serious motivation loss for the same work that you initially found super exciting. Also, keep in mind that the more you build on an obsolete architecture, the harder it gets to upgrade or refactor it later on.

So do it. Please. Allocate time for it. Even if you don’t feel like you need to upgrade or maintain your stack, take the time to improve it.

I personally like doing this in a very specific moment: before kickstarting an important project. Laying down everything you would need to make this project easy will immediately show you the weak spots of your architecture. Don’t work around them, take the time to solve them. The solutions we implement now won’t help us just in that next project, they’re also going to impact multiple things in the future.

Here’s one process you could go through:

What are we going to build or analyze?
What is missing in our data stack to make this project easier and fun?
Update data stack accordingly first
Start project only then

An Unsplash example in 4 steps

“Click pen and magnifying glass on book page” by João Silas on Unsplash

What are we going to build or analyze?
We’re working on a pretty huge refactoring of the search engine. To provide the best results, we’re analyzing data that we collect when people search things on our website.

With this refactoring, the main goal is to improve the pertinence and quality of the results. To do so, we realized we would need to collect more data.

What is missing in our data stack to make this project easier and fun?
The problem is … we were running an infrastructure based on a warehouse optimized for computation speed and not for storage capacity. Meaning that thanks to this warehouse, our whole data stack (from processing to visualization) was fast. The counterpart being that adding more storage space is expensive … and it’s what we needed to improve our search.

We realized that we picked this fast warehouse because at the time, we felt like it was scalable enough and because it would improve our stack’s performance: queries and ETL processes would run faster, dashboards would load faster etc …

But now, we’re realizing that the most interesting for us is being able to store more data. Paying for more space in a fast warehouse was a solution, creating a second fast warehouse was another one…

Update data stack accordingly first
In the end, the best solution we picked was to sacrifice speed to storage space. We moved away from the fast warehouse to implement a large warehouse because today, we think it’s more scalable. We think it’s easier to emulate performance than it is to emulate storage space. We now think storage space allows you to build performance while the opposite is not true.

This had costs …

Work time (setting up the architecture, compensating for performance loss, reconnecting everything, improving ETL processes etc …)
Data migration (data freeze, unaccessible dashboards, data collection delay etc …)
New errors popping out and needing to be solved

… but we feel like we’re in a much better spot today. We’re not afraid of collecting data anymore, we’re also paying less for our infrastructure. Main change: performance achieved with code has replaced the natural performance inherited by the warehouse.

We’re absolutely not unhappy with our past choice of running a fast warehouse. It actually did help build our stack faster and provide a better data experience to the team. It’s just the right time to move to a more appropriate solution now that our needs are adjusted to the reality.

Now that we have our data stack where we wanted it to be, we can start our search refactoring project.

For all the data engineers out there, make this a reminder to always allocate time to improve your data stack.