The two critical steps to reach domain oriented ownership

Published in

Agile Lab Engineering

12 min readJun 6, 2024

Lack of data ownership is the origin of all the issues.

The current scenario counts endless cases where a centralised data team is in charge of the whole data value chain from operational systems to a serving layer and back to operational systems (reverse ETL).

When this game starts, all are very happy about it. A single central data team with the best geeks building amazing pipelines. All is good.

After some months or years, problems start revealing their roots with data quality and compliance.

But, why?

The saga of the operational vs. analytical teams

The most popular ownership model is system/technology oriented.
The same data with the same meaning (that is, the same information) could take place multiple times in multiple systems. CRMs, ERPs, etc. are usually managed by system experts (experts of SAP, Oracle, Salesforce, etc.).

In the operational space, there is a strong overlap between technology and functional knowledge. Such systems are huge and there are generations of professionals specialised to manage a specific sector through a specific tool. Such orientation is strong enough that technology has oriented the mindset on how to reason around a business. This is visible even in the language used to define requirements: words and concepts are often inherited by operational platforms.

Data is data. So, operational vs analytical space is just a technical problem. We would be very happy if we could attach analytics directly to operational systems without affecting anyone using the system.

But… it’s not the case (yet). Since operational systems are the source of all data, they are probably the source of all issues.

Why don’t we fix the problem where it occurs?

Operational systems do not need analytical data, the contrary is true.
So, why should operational teams care about analytical data?
Once they extract data for a data team, they are done!

Obviously, this is not true, but ownership is defined by technology boundaries and companies feel safe with this approach. Adopting this mindset, several other issues appear:

Budget is not enough. Since operational teams build applications, there is no budget for data analytics.
Skills are not sufficient. Having tons of consulting people that are experts in specific technologies do not easily handle other technologies.

Actually, the wild habit of copying data here and there started whenever operational systems had to integrate with each other. Since no one wanted to touch anyone else’s system, there was much implicit agreement that a good compromise was copying data between areas (systems/technologies).

A data copy is requested by a consumer, thus, the consumer is responsible for the data copy.

To resolve this common issue, it could seem reasonable to enable the company through a central data team that will take ownership of data copies building the necessary data integration among systems.

Now, operational systems overcome this need through microservices integration and the problem is moved between operational and analytical systems. Thus, the same data team can maintain these copies to serve data analytics.

Data ingestion, preparation, enrichment, cleansing, wrangling and any sort of data manipulation between operational systems and traditional column based OLAP systems have flooded the literature on data management by generations going through normalization and denormalization, Kimball, Inmon, and a huge choice of enterprise data management tools.

A central data team doing ingestion from every company’s data source would keep the pace of operational changes only if the size of the organization is very small.

Every schema change from operational systems can affect downstream pipelines.

Who is responsible for it?

Technology-oriented ownership doesn’t balance the forces of the single work stream. In fact, the central data team is subjected to any decision of operational teams.

Whenever a data quality issue occurs from operational systems, the central data team must chase this issue from where it is visible back to the right operational system and table. The cost of troubleshooting this issue comes from the operational team but it is charged to the data team. Also, if the operational team doesn’t fix the problem, it will occur again and again for each ingestion from the same data source. In order to mitigate such issues, and to reduce costs and effort, data teams are willing to invent any compromise including

patching data
patching business logic from operational systems.

This is obviously an immense technical debt under an implicit company’s amnesty.

Let’s consider data contracts

Now, consider data contracts as a means to regulate data coming from operational systems to data analytics ones. Data contracts cannot be a manual practice otherwise they just don’t scale in the organization. Tackling ownership with manual artifacts makes it even worse. It would be the n-th time-spending process on top of many others. Let’s see why.

Data contracts should be delivered as a self service capability to enable data producers providing the right guarantees to data consumers. This looks like reasonable at a first glance, but digging into it closely, you should see that we are reversing the logic.

Who delivers a data copy is responsible for it.

Do you agree with me that adding data contracts doesn’t automagically solve the ownership problem?

Putting a data contract between technical producers and technical consumers doesn’t tell anything about the organisation around it.

As a central data team, I own data integration tools and my dear data platform. I want to put a clear boundary between what’s mine and what is not. Thus, I start talking about data contracts around the company to disseminate the practice and I show how to split accountabilities in the organisation in a different ways.

From now on, our SAP team is gonna be responsible to produce data quality metrics, schema validation, observability metrics for each data set they produce for the data platform.

And people will start laughing at this. Because the SAP team is a SAP team, they don’t manage other technologies and there is a technology gap around a data contract.

Having a technology oriented organisation brings such a problem.

The two moves

Two moves are necessary to enable data contracts from an organisational point of view:

Reorganisation. The management must agree to steer teams’ skillsets to be cross functional (moving people, hiring, etc…non trivial actions)

Technical enablement. The platform team must reduce both cognitive load and effort to the minimum.

Ok, now it should be clear why we need self service capabilities.

Opinion-based decision making

A piece of organisation can take ownership of data if this is extremely easy. Top management will believe in this practice if they see efficiency, more quality and saving. Of course, this must be measured and this is a chicken-egg problem.

How can I invest money in something (data contracts, self service capabilities) I don’t know the extent of the problem it is going to solve? As a data executive, why should I bet this is the right move for my company? Why and how should I take this responsibility without numbers to make decisions from?

In fact, no one actually measures anything in the data value chain. And this is a big issue.

Technology over practice

Measuring costs and efficiency of a technology is relatively easy. Usually, software platforms are equipped of radiators explaining how you are very well spending your money using them. At the same time, competition makes software vendors fight for the lowest price. At the end of the year is much easier to evaluate costs produced by a Kafka cluster of your vendor and compare it to another vendor’s Kafka product than measuring processes and organisations. Since price vs cost is the only available number, data executives will opt to focus on technologies rather than practices to “improve” the overall status quo.

Pursuing the utopia of information

Cost saving (opting for cheaper tools) provides local optimisation of the value stream. Process optimisation can improve the overall spending and performance of the value stream but it requires to gauge a certain complexity.

That’s why, inheriting the best lessons from lean management is a good idea. If you are a Data Executive and you are not measuring anything, trust the traditional lean management and measure. Data contracts and self service capabilities are a means to measure.

Big enterprises are big and it is easy that the central data team is not unique. Likely, we can see many central data teams owning data copies spread across the company. Every central data team owns a data platform. In the era of big data, copying data from operational systems to a data lake to overcome issues of traditional technologies was fine. Multiple operational systems, multiple data systems, local technology oriented data centralisation.

Unfortunately, the size of data teams is always much lower than operational teams. Thus, data teams suffer the burden of implementation. But the blanket is short. If the data team puts the capacity to build new use cases, this reduces the capacity for data incidents resolution and vice versa.

This is not easy to prove. As I was saying, in data engineering there is a lack of measurement in the data value chain. It’s difficult to prove which are your numbers and usually Head of Data, CDO, Head of Data Governance and similar roles rely on their own (gut) feeling rather than metrics. Long story short, it looks like the data practitioners world keep ignoring the experience of software engineering (like DORA metrics).

Also, the cognitive load could reach a certain invisible limit: how many domains of knowledge can a single team be expert of? Unless the data team expands, this is a cognitive capacity to take into account.

It should be clear to everyone that this approach is not sustainable, most of all if we don’t fix the problem at the source.

Anyway, data must be consumable. The path between available data from operational systems (users generated data) and data consumers (users, analysts, decision makers) is what I call data value chain. Whatever we put in the middle, it has the purpose to make data consumable and more informative.

Every company strives to extract information from data to make good decisions. But data is information and information is data. Thus, reaching the perfection is a good utopia to aim at through continuous improvement.

Data analytics and domains

Once everything converge in a single data team and technology, we may ask ourselves how to enable data consumers to use all our data.

A combination of data management paradigm and data practices is necessary to fill in the data value chain. DWH, Data Mesh, Data Lake, Data Lake House address the HOW and data practices tell us the WHAT is necessary to do to make data consumable.

Data warehouses have passed through multiple approaches to data modelling strongly influencing data architectures from Star Schemas to Data Vault.

In the era of big data, starting from a data lake architecture, we have been adding multiple layers to make data consumable while keeping on the central data team the responsibility to apply all the necessary data practices to guarantee that a dataset is consumable. DWH and Data Lake House experiences tend to converge in terms of data modelling techniques since they are actually addressing the same problems but for different data volumes and integrity/consistency guarantees.

Anyway, domains are not autonomous to build their own journey from data to data consumers since they must pass through the central data team. Thus, decentralisation of ownership is not possible. This comes with many other issues. Let’s scratch the surface.

Ingestion and data engineers

As you can see, the ingestion layer contains raw data. This requires technical integration where the central data team is responsible to extract and load data to a format different from the one available in the operational source. That is, the central data team is responsible for a format transformation that preserve the original data values. This is not considered a transformation since we are not changing the meaning of the data, neither are we applying any business logic. Usually, this part of the central data team is skilled for data engineering tasks. Still they must know the meaning of data at least to apply the right technical transformation. Here data quality is important to check data source and conversion issues.

Curation and data stewards

Getting to the curation layer means being able to make sense of data. In fact, a traditional data governance team starts working from this stage to enrich data sets with business metadata in a catalog. The data governance team should know how label through business terms data sets coming from any area of knowledge of the company. Often, they chase business analysts, and development teams to keep the pace with the many data sets entering the data management platform. They must understand data coming from the business otherwise enrichment is impossible. Dealing with such a massive number of data sets is very difficult. Data Stewards are overwhelmed by the amount of work.

Data quality at this stage checks whether normalisation and standardisation work.

Aggregation twisting between business analysts and data engineers

The aggregation layer is the story of business analysts asking data engineers to build data sets they don’t know almost anything about.
This is real transformation, we apply business logic building new data from existing ones. Data quality checks whether the business logic is effective, if we provide the right level of integrity, if we can rely on it. Usually, the aggregation layer is more informative, data is available in a shape more familiar with the business. Data consumers expect to access the aggregation layer to build their own reports, analytics and KPIs in the consumption layer.

Domain ownership and self service capabilities

From users generated data to data consumers the journey is challenging. Large enterprises have a lot of complexity in the middle spanning multiple methodologies, practices, areas of knowledge and systems.

Self service platform

A self service platform provides data owners with capabilities to easily ingest, curate, aggregate and serve data autonomously without technical complexity. Whoever builds data should be able to care only for data related concerns and ignore the rest as much as possible. Since metadata are information on how to manage data, metadata is a primary concern for data owners.

The self service platform is the real technical enabler for data ownership. Without self service capabilities, we cannot move data ownership to area of knowledge since we cannot simplify the data generation process.

A self service platform can enable domain oriented ownership independently from the data management paradigm (data lake, data lake house, data mesh, data warehouse, etc.). In fact, if it easy for a data producer to build and maintain a data set, this would be enough to embody this data set within a data lake house layer or a data product.

Central data teams do not usually work on self service platforms. They have data capabilities they use to craft use cases from scratch. A data team should act as a platform team, leveraging platform engineering principles through strict standardisation of architectures and data interfaces.

Reorganisation

A self service platform creates an opportunity to exchange value between data producers and data consumers. Given a self-service data platform, the only remaining impediment is how the organisation wants to shift the mindset from technology to domain oriented ownership.
This means that rather than having SAP, or Salesforce teams, companies should aim at having cross functional teams like Work Order, Billing, Assets, Customers, and Sales.

Resources can be shared or dedicated depending on the demand on a specific skillset and the respective available capacity to build the business cases. Anyway, it is essential that a data owner must be delineated within the knowledge area. Having stable teams is a value, but it is reasonable that growing in a certain area could require to hire/move resources rewinding the tape through a path of maturity like the famous forming, storming, norming and performing group dynamic (ref. Tuckman). In terms of team topologies, we would aim at stream aligned teams rather than complicated subsystem teams (technology oriented) where our north star should be our area of knowledge and anything that makes this knowledge easily consumable (self service capabilities).

Conclusions

In this article I’ve explored why companies struggle with data ownership and which enablers they should invest in to resolve the origin of all the problems:

A self service platform is a technical enabler of ownership putting data producers in condition to build compliant by design reusable data assets;
A reorganisation is necessary to reshape human resources around stream aligned teams able to leverage their knowledge to build solid data solutions