Data Debt: what it is and how to estimate it

7 min readMay 27, 2020

Why Data Debt?

In my previous article on DataOps I mentioned immature data pipelines as one of the obstacles to deal with on the way to a data-driven organization. In many companies the data solution development is still a handcrafted and largely non-repeatable process with minimal reuse. The result is both a plodding development environment that can’t keep pace with the demands of a data-driven business and an error-prone operational environment that is opaque and slow to respond to change requests.

Immature development and delivery processes force business users to build their own solutions that result in an ever-expanding universe of data silos, intensely fractured data environments and general data heterogeneity.

These examples can be considered as Data Debt, which stems naturally from the way companies do business, especially when they are running their business as a loosely connected portfolio or making “free rider” decisions about Data Management and Governance.

Data Debt concept can be used as a strong argument in discussions with key stakeholders for driving the establishment of a company’s new data related processes and policies. However, to perform its function Data Debt should be precisely defined and estimated, preferably in terms of financial risks and opportunities.

Data Governance and DataOps disciplines will serve as instruments of paying Data Debt down, reducing it or (in everyone’s dreams) avoiding to accumulate it in the future.

Lean and DataOps perspective

Many Data Engineering or Analytics teams are too busy delivering outputs or maintaining systems to stop and think about how and why they work the way they do. Organizations are thrilled to apply advanced analytics to any business area, but very rarely employ it to improve workflows. As a result, the long existing process of doing things is almost always passively accepted. If looking closely there is always waste in Data Engineering, Data Science and Analytics efforts. The Lean approach can be used to find and eliminate the waste, thus directly or indirectly dealing with Data Debt.

Excessive time-wasteful processes and artifacts

Extra processes result in an unnecessary effort that does not create value. Extra processes include duplication of data and transformations in multiple data stores across the organization or using a complex algorithm when a simpler one would have worked as well. Unreproducible work due to lack of configuration management is another major cause of time waste among inexperienced data teams. Documentation that is not used, detailed project plans and estimates that are impossible to adhere to, and status updates that decision-makers do not use can be considered a process waste as well.

To start eliminating wasteful processes DataOps principles “Apply Agile process and software engineering best practices” and “End-to-End Processes and Continuous Improvement” can be a good starting point.

Waiting for everything

Waiting for people, data, and systems is a massive cause of waste in the data processes. Examples of waste include time spent finding and understanding data, waiting for access approval, waiting for people to be assigned, waiting for software and systems to be provisioned, waiting for data to be made available, and waiting for jobs to run on slow machines.

Implement DataOps “Reuse and Automate” principle to automate wherever possible and reuse existing artifacts to avoid unnecessary rework and repetition.

Incorrect problem definitions and defects

A correct problem definition is surprisingly hard in Analytics and especially in Data Science. Solving the wrong problem, usually due to poor communication and unaligned expectations, is considered defective work. Data, as well as code, can have bugs. Poor-quality data is the number one challenge claimed by data scientists. Defects in data and code lead to wasted effort finding and fixing problems. Writing good code on top of bad data is merely a case of garbage in, garbage out.

General Software Engineering, Lean and DataOps practices can help here by uncovering all defects in the data as soon as possible, including a set of mistake-proofing tests so poor-quality data does not enter data pipelines, stopping the line, fix the problem, and add a new test so the error cannot cause a problem again.

Non-complete work never going to production

Partially done work doesn’t help an organization to make decisions or improve customer experience. Not thinking about interpretability or explaining solutions clearly to stakeholders delays or cancels implementation unnecessarily. Partially done work includes documented requests that data engineers or data scientists haven’t started work on, untested code and data pipelines, unmonitored pipelines and models, unrecognized benefits of the created data product. The biggest waste though is work that doesn’t go into production, the work being stuck on someone’s laptop because business never considers it to make a decision.

To deal with these issues DataOps principles “Integrate with your customer and deliver business value”, “Collaboration and Communication” and “Keep it simple” can be used to align any prototype or concept with your stakeholders before any further development is planned or made.

Multitasking and excess motion of personnel

Any data related discipline is sophisticated and requires deep focus and concentration to solve problems. Multitasking imposes high switching costs as we move from one task to another and wastes time. Multitasking also wastes the potential early contribution of work. Excess motion from separate and disperse teams causes waste from handoffs and travel. Every handoff leaves behind tacit knowledge of the data, which is very difficult to capture in a document. It’s even not about DataOps or Engineering practices, it is simply about common sense.

Not fully utilizing expensive data talents

The last but not least Data Debt related problem is an optimal way of using data talent. When data storage, computing, and software costs are constantly declining, people are the most expensive and valuable resource in the data related processes. Therefore, the maximum use of their talent should be a priority. Good data analysts, data scientists, and data engineers are hard to find and expensive to hire, but their skills are often wasted for work below their ability and qualification level or they are not engaged in decision-making processes.

Data Governance perspective

Cost of poor Data Management

“Data Debt” is a term taken from the Agile Development world and the concept of “technology debt.” From a financial perspective Data Debt can be defined as the amount of money required to fix data problems. It is a way to communicate the cost of mismanaging data stated in terms of an obligation to the future to fix the data problem.

Alternatively, a company could look at all of the current data issues, and report on a rough estimate to fix them all. Until the debt is paid, an organization will always pay more to maintain its data landscape than gradually investing in reducing Data Debt.

Data Debt can take various forms and here are just some basic examples:

Costs of poor Data Life-Cycle Management. Excessive storage costs (and software licenses in some cases) due to data not timely moved to archive or deleted. This applies not only to production environments, but also and especially to development, testing and user sandboxes.
Costs of waiting for data to be delivered and accessible. This is anything mentioned in the previous section from poor data engineering practices to administration and tasks, which delay data delivery to the hands of business users.
Costs of Data Quality issues. These consume an enormous amount of cost and resources and usually serve as the primary manifestation and metric of a functional Data Governance program. Therefore, it is important businesses measure current costs and risks associated with data quality.
Costs of data redundancy. This is a duplication of data elements across the entire data landscape, which can be extremely time consuming to fix.
Costs of data export. What does it cost to extract data from a system to provide a file to users. Most organizations do this, and most have no idea how risky or expensive it really is.
Costs of misalignment. It can be as simple as the excessive costs of doing three projects to deliver the same data to three departments.
Costs of data at risk (personal and other regulated data). Amount of data that can create risk from misuse, poor quality, or compliance issues, which in turn can lead to fines.

With these examples in mind let’s have a closer look at the Data Monetization areas, which all can be a factor in calculating Data Debt and developing measures of paying it off.

Cost of missed opportunities

There is always a work being done to highlight what will happen or continue to happen. However, it is usually overlooked what can not happen without proper data related processes and policies. It is absolutely necessary to recap existing issues with data, reporting, poor content management, scary compliance issues, or the high cost of ownership in terms of not only factual losses, but also missed future opportunities.

A good source of metrics to see if Data Governance is working is to categorize business initiatives by the usage of data including advanced analytics and AI. Any Data Governance activity in support of data monetization categories can be measured as to the amount of benefits received or in Data Debt perspective — the amount of benefits lost.

Consider this categories a start for estimating the Data Debt impact on the organization’s ability of using data to improve business outcomes:

Processes: improve cycle time, lower cost or improve quality
Competitive advantage: gain competitive intelligence and create differentiators
Product Development: identify a new product or feature
Intellectual Capital: embed knowledge into products and services
Human Resources: enable employees to do better work
Risk Management: reduce various types of risk (financial, data related or legal compliance)

Looking closely at the list above will help to start getting metrics that can be used to assess the amount of accumulated Data Debt, reveal to leadership the huge costs in delaying doing the “right things” with data and show the potential value of a Data Governance program.

Conclusion

If a business makes decisions without considering the impact on or use of data there are costs, which occur in the future when dealing with lack of consistency, errors and redundancy. Like for any other future obligations there is interest to be paid, which gets more expensive to fix later than now.

With the rise of the data-driven organizations Data Governance was reborn to show it’s undoubtful value. Measuring its value and the value of data in general the concept of Data Debt can be considered as a new field of infonomics.

Like any other debt, Data Debt needs to be paid or written off. Data Governance implementation becomes a Data Debt repayment process and companies may use Data Debt as a guidepost to drive its adoption.