Technical Debt in Data Science Series — Part 1

Learning about the Technical Debt aspect of Data Science

Vimarsh Karbhari
Acing AI
5 min readAug 7, 2018

--

A Data Science Interview involves different challenges for a potential data scientist. As much as the interview is for the company to decide if the person is a fit, it as also, for the person to decide if the company is a fit. Understanding a company as a fit requires one to ask some important questions to the interviewers and understand how the data team functions in different areas. Technical Debt in Data Science is one such area.

The AI/Data Science field has been is an amalgamation of different fields in which software development is one of them. This creates interesting challenges of technical debt which can kill your AI stack.

Photo by Alice Pasqual on Unsplash

I have been tackling the problem of technical debt in my current role. Having an experience building software systems as well as building projects on the side that utilize ML packages leads to exploration of unique perspective of Tech Debt in AI/ML systems. This article will concentrate on the Debt which can easily understood or correlated with software development aspect in Data Science. Next article, will focus on Debt from a Data and External System perspective in Data Science.

Why Technical Debt in AI/ML?

Machine Learning and AI systems fundamentally have complex code. They also have huge system implementations which are highly complex in nature. All the ML algorithms feed on large pools of data which is different from normal software systems. These unique combination of code, infrastructure, data and systems contributes to Tech Debt in ML.

How does Technical Debt Incur in AI/ML?

Technical Debt in Data Science incurs due to four different aspects:

  1. Model Debt
  2. System Dependencies
  3. Data Debt
  4. External Debt
Photo by Fabian Blank on Unsplash

Model Debt

Model Debt is the Debt associated with models used in an ML prediction. A model is a statistical representation of a prediction task. You train a model on examples then use the model to make predictions.Traditional software engineering practice demonstrates that strong abstraction boundaries using encapsulation and modular design help create maintainable code in which it is easy to make isolated changes and improvements. Abstraction boundaries help the invariants and logical consistency of the information inputs and outputs from a given component.[1]

Entanglement:

Machine Learning packages and systems leverage data with code. This creates entanglement which cause problems in making future enhancements or upgrades.

Let us take an example to understand this better. Consider an instance of classifying a webpage. The ML Algorithm has to predict if this instance of the webpage consists cats. Let us consider a feature x1 “contains the word ‘cat’ ” as one of the features. Similarly, we have n different features which help us predict about this instance of the webpage. They are labeled x1...xn. For this prediction we deduce a model which help us predict if the webpage will consists of ‘cats’ with a certain probability. The entire statistical model relies on all features x1…xn to make an accurate prediction. Adding an additional feature xn+1 will mean recomputing all n features as their importance, weights or usage may all be different now. Similarly, lets say we remove the feature x1, the same issue will arise again. Recomputing all features again to get to the accurate prediction. If we change the feature x1 to “contains the word ‘cats’ ” that will have a similar impact. This is true in the case of whether the model is retrained fully in a batch style or allowed to adapt in an online fashion. No inputs are independent. This is the crux of the entanglement issue of Model Debt.

Build/Visibility Debt:

This is a classic debt straight from the books of software development. This is also classic in case of programs which consume APIs. Visibility Debt or build debt develops in software engineering when the source code in a monolithic architecture is referenced at multiple places making it harder to refactor. Similarly, in the case of Model Debt, a ML model is consumed by a plethora of systems without any access controls by providing their outputs/logs to downstream systems. This causes undeclared dependencies among systems which makes it harder to refactor the models if need be.

System Debt

This is Debt associated with the underlying system itself. Like all code is susceptible to Debt, the system which runs the ML models is no different. This Debt is also relatable by most Software Engineers and Engineering Managers.

Photo by Samuel Zeller on Unsplash

Package Debt:

ML researchers usually develop code/models in packages. The field is currently exploding but the APIs may still not be production grade. This results in having ML code running on top of system implementation code to reduce the dependency on APIs. Developing this kind of design results in a lot of supporting code written to get data in and out of packages. This leads to package Debt.

Pipeline Debt:

Data Engineers today spend a lot of time building and maintaining pipelines. When new data points present themselves, these pipelines are upgraded and revamped. Managing these pipelines, detecting errors and recovering from failures are all difficult and costly [2]. Once the new pipelines are ready, there is additional end to end testing, downstream system testing and other testing aspects related to this change. This incurs a significant amount of Debt which can be termed as Pipeline Debt.

Dead Experimental Codepaths:

Data Science discoveries are all about experiments. Usually after every five to seven experimental tests, one of these results into breakthrough ideas. Experiments are enabled by testing and tweaking algorithms within the same production code. Flexibility is key here. The cost of running these experiments are relatively low. This flexibility causes dangling code paths into the production code. This results in dead experimental code paths which is a type of system debt.

This concludes Model and System Debt. The next article will talk about the two other remaining types of Debt, Data and External Debt. Please comment on this article if you would like to know how to tackle the Technical Debt in Data Science.

References:

[1] M. Fowler. Refactoring: improving the design of existing code. Pearson Education India, 1999.

[2] R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, H. Jiang, T. Qiu, A. Reznichenko, D. Ryabkov, M. Singh, and S. Venkataraman. Photon: Fault-tolerant and scalable joining of continuous data streams. In SIGMOD ’13: Proceedings of the 2013 international conference on Management of data, pages 577–588, New York, NY, USA, 2013.

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young. Machine Learning: The High-Interest Credit Card of Technical Debt

--

--