How Shadow Data Teams Are Creating Massive Data Debt

7 min readMar 22, 2023

The Dark Side of Quick Time to Market

🚀 We are now hitting the +150 subscribers milestone🚀

The last article about the Past, Present and Future of Data Architecture was very well received by the data community. That gives me extra energy to continue to share my real-life experiences in the data world with you.

👋If you are a new reader, my name is Diogo Santos. I write about data product principles, the evolution of the modern data stack, and the journey to data mesh (the future of data architecture).

In today’s article, I will talk about shadow data teams. What are they, how they are born, and their impact on data initiatives. This is a concept of my own.

Please consider subscribing if you haven’t already. Reach out on LinkedIn if you ever want to connect.

Tags: Data Engineering | Data Teams Structure | Agile Data Insights

The race to become a data-driven company

It is long gone the days when data leaders and executives, in a mission to ensure data democratization and impact in the organization, would only be concerned about:

The data professionals they hire. They should bring enough experience to handle the current data stack but also to be prepared for the technologies coming next.
The data stack. It should ensure all pieces of the data lifecycle would be working properly, from ingesting to storing, to serving and consuming.
C-suite buy-in. They ensure enough funding to get the above two under control and doors opened in the organization for making transformations with data.

With those three ingredients, data teams in the last two decades would be perfectly capable to get all the data they need, process it and make it available for consumption in the multiple potential data use cases across the organization.

But what changed then?

The arrival of big data would transform the small and structured set of data into an uncontrollable volume of data being ingested in the lake. This shift has created multiple challenges in the lake’s organization. Data models and schema design for the data warehouse became much harder to handle. We could even argue that the data warehouse lost its true purpose since then.
The proliferation of data sources. Data was mostly flying through ETL processes from operational systems to operational data storage (ODS), which would ultimately feed the data warehouse through a CDC (Change Data Capture) process. Nowadays, traditional ETL has almost no purpose except for early-data-stage organizations. ELT is how data is ingested in the lake. Besides the operational systems, data is also coming from 3rd party APIs to allow data enrichment. Server-side and customer-side events are also being collected using tools such as snowplow.
The increase of potential use cases that were enhanced by the arrival of new types of data such as text, image, and sound, by the new available NLP, CV, and more traditional ML models that could be installed in your notebooks with a simple pip install and by the arrival of the cloud providers that would allow data teams to process data and train models accessing robust CPUs and GPUs.
The diversification of data consumers was the result of years of investment in data democratization in organizations. This was also the result of the huge hype around AI in general. Data teams were now supporting marketing, sales, finance, operations, distribution, human resources, and many other departments in the enterprise model.

Do you know what didn’t change?

The need for collaborative data teams. The demanding transformation in the data world strengthens the need from having software engineers, data engineers, and data scientists work together. Yet, these data groups have never been as separate as they are now. Software engineers (SWE) continue to be disconnected from the analytics cycle while producing most of the data ingested in the lake. Data engineers became middle man in the process, consuming data from SWEs who have no incentive in providing correct data. Data scientists, who are close to the business are forced to find a way to overcome the previous collaboration problems and deliver solutions by themselves, with or without trustful data.
The importance of understanding the business context. To deliver impactful data products all teams involved must have a clear understanding of the source domain that generates the data and business domain expertise. This didn’t change. Today data professionals need to further understand the business if they want data products to be successfully adopted in their organization.
The time to insights. With the arrival of agile methodologies and the shift in engineering practices, organizations now expect to have access to data insights very quickly as well. Quicker than before. Yet, most data teams still face challenges to access the data they need to build ML models, reporting, or dashboards.

Organizations that accumulate these problems are creating a monster that I call the shadow data team mode.

The different data teams set up

Generally speaking, there are three ways to set up a data practice, and they will depend on the size and the maturity of the data ecosystem in the company. The latter, decentralized data teams, is the first step toward data mesh adoption.

The relation between insights velocity and governance

Centralized Data Teams — The most common one we see in organizations. There is less freedom to operate as the governance is rigid. All the data is centralized. Here the teams are hyper-specialized in technology. All changes follow specific governance and validation process. Insights get longer to be delivered, but the analytics lifecycle is under control. Yet, business consumers are not satisfied.

Shadow Data Teams — This setup is also very common these days. It’s when data scientists are performing data engineering on top of data engineering work. Or when organizations first hire round are data scientists without data engineers. This means, no proper data infrastructure will be put in place.

Decentralized Data Teams — This setup comes from the data mesh shift, where data teams are enabled to work more independently but with a central entity to manage governance.

For this article, I will be only covering the shadow data team setup. Future articles will provide the differences between practices and the main challenges between them.

Shadow Data Teams. What are they?

The shadow data team model is described by the creation of new teams (often composed of data scientists) delivering data initiatives directly into production to achieve better time-to-market.

New technology platforms for data processing and manipulation such as dbt are adopted by the analytical team, to “replace data engineering work”.

Business stakeholders are happy within this environment because initiatives go live in reduced times, creating the false expectation that handling data is “easy”.

Establishing a good data practice in this scenario is impossible. There are no rules. What matters in this mode is to deliver as quickly as possible.

How do we get here?

The business request for a new data feature or product.
Data scientists start exploring the request and figure out that either the data they need is not available or it’s not available in the expected structure or schema.
Data scientists request a new data pipeline for data engineers
Data engineers, have a huge pool of requests and are taking weeks to months to get this request solved.
Data scientists are pressured to deliver the new data feature/product.
Data scientists decide not to wait and access source systems and 1st party databases directly, creating a SQL or dbt pipeline that won’t have production standards, CI/CD best practices, and clear ownership.
Data scientists get used to this process, as it provides bigger autonomy and end up generating massive data debt in the organization.

The Vicious Cycle of the Data Scientist Workflow | Source: The Existential Threat of Data Quality

What are the consequences?

The cost of quick time to market in this scenario is data debt. This debt manifests itself later and on a large scale with the following effects:

Inconsistent and unreliable data
High maintenance costs
Data issue troubleshooting becomes hard
Hard to keep people accountable for the generated data.
Unreliable machine learning models
Additional costs related to unmaintained pipelines, tables, and duplicated data
The lake and data warehouse become more complex to navigate

In this scenario, the data engineering team loses credibility because the organization will see results being delivered without them. It becomes hard to demonstrate the negative effects in the long term when these are not visible yet.

After a few years, it becomes highly complicated to make a reversal when collateral damages begin to appear.

Even if the head of data wants to implement significant changes, the business is already benefiting in terms of agility and speed, and they are not willing to lose them.

What comes next?

If you have been reading this publication you have noticed that I have been highlighting the problems that we face in the data industry, mostly because we have been dealing with data from a purely technical perspective.

My objective is to generate awareness around these problems and get people to relate to my examples because they are real-life stories.

Thus, in the upcoming weeks, I will be sharing additional content about

The increase in data debt
Why a data warehouse is the most important data asset in the organization

After those, I will start finally introducing data mesh.

If this is relevant to you, make sure you subscribe. Feel free to suggest in the comment section any other topic that you would like to see covered. That will help me to focus my writing on your problems.

Make sure you follow me on LinkedIn for more weekly posts. Thank you so much for reading and let’s talk again soon.

Originally published at https://dataproducthinking.substack.com on March 22, 2023.