Data Governance is about Control, not Quality

Published in

Collaborative Data Ecosystems

11 min readNov 29, 2021

Data Quality doesn’t matter, in fact Data Quality is a fools errand, a genuine anti-pattern when it comes to achieving the business goals. I’ve said before that successful MDM programs shouldn’t focus on quality, and in the years since I’ve more and more come to the realization that Data Quality is the problem not the solution.

There are three types of data quality issues:

Data in the source system doesn’t reflect reality
Master Data fails to uniquely identify ‘the thing’
Someone messes stuff up in a data pipeline

I could argue that point 2 is really point 1, but I think Master Data is sufficiently different from transactional data to justify its own issue. I’ll start with point 3 though.

People choose to mess up the data pipeline

When data is messed up somewhere in the data pipeline its either a bug (in which case fix it) or its about conflicting controls and authority. I was once in a meeting for a company where the CFO asked all the FDs of the different divisions a simple question:

“If ever division is reporting over 30% margin, why is the corporate margin 20%?”

The answer was that the corporate margin was audited, while the divisional reporting was operational. In other words people were going about their day to day management with an assumption that they were doing 50% better than they were in reality. This was a deliberate thing. Not malicious but the result of a series of decisions that helped inflate divisional margin, which drove bonuses and other financial incentives. People focused on things that improved the operational margin, but did nothing when it came to the audited margin. The fix was simple, and painful, margin calculations would from that point forwards be done from central finance for all divisions, cue much exploding of the divisional heads who didn’t want their reported margins to be slashed.

This wasn’t a data quality issue, it was a question on control, a question on who owned the right to state what the operational margin was and whether that could differ from the corporate audited margin. Now this is a very clear challenge, but I’d argue that a lot of data pipeline inconsistency issues stem from exactly the same point. So the sales team calculate LTV differently to the finance team, is this ok? Are the sales people allowed to be more bullish and focus on a different way of viewing the customer that helps drive growth, while the finance team uses the same term for something that focuses more on profitability? My point here is that unless its an actual bug then inconsistencies in data pipelines tend to stem from individuals being allowed or able to create those inconsistencies.

People choose to prioritize process over data

So onto point 2 and master data. If you can’t uniquely identify the ‘thing’ be it a person, location, product or anything else then it will come down to either a data collection issue at the start of the process or a willful design of the IT landscape to not be based around data but instead around processes. At many manufacturers I’ve seen both of these cases on a regular basis, the former where the shipping address isn’t captured, because Sales doesn’t need that to be good for Finance to bill, and thus get their bonus, or on one notable occasion when I had to review a ‘failing’ MDM program and the corporate bonus structure was that you got twice the bonus for ‘new’ customers as existing. Cue lots of ‘new’ customers that weren’t. Change the process so you got 25% of the new customer bonus in the first month and 75% at the end of the quarter only if they were truly new (post the MDM process) and guess what… magically the “MDM” issues went away, almost like it wasn’t an MDM issue at all.

The other part is the domination of ‘process’ oriented systems. With Master Data delegated to way, way after a transaction has happened. One way to sound smart if you turn up at a manufacturer is to ask the following question:

“Do you have problems tracking a specific order from lead to cash to service & support?”

You’ll probably get a “How on earth did you know that?”, and an assumption that you’ve already got some in depth knowledge about the business, the reality is that its a common challenge. The challenge is normally that different systems create different identifiers, different BOMs from sales to manufacturing to distribution, and generally the ERP systems just act as if they are isolated islands of perfection. Then the poor data team tries to come in and add some order, then there are a few acquisitions and suddenly you are looking at 3 Product MDMs and 4 PLM systems and a total inability to get decent visibility. I wrote a paper once for the board of a company titled “The Myth of ‘as long as it ships’”, explaining how their lack of control on data and process centricity was ensuring they’d struggle to compete. Again this isn’t about data quality, its about governance, its about control, the various different business departments all had their own process systems, and all felt very comfortable in not having any horizontal visibility as long as ‘in their silo’ it was good, and if it wasn’t that they could complain about another silo, or best of all the central IT team trying to herd the mess into something coherent. Either way, its an issue that is about control and the authority to make an organization data centric.

People choose to have bad data

The last piece that really underlines how Data Governance isn’t about Quality is where the majority of bad data comes from: source systems. Its wrong at source because either someone doesn’t want to fund the project to fix the data in the source, or indeed fundamentally doesn’t care about the data errors being created operationally as the impact is only felt further down stream. The MDM challenge I mentioned earlier was also an example of this, where sales KPIs drove bad data, but I’ve also seen it where R&D teams had a ‘mandatory’ field of “Lead Time” for a part, and because it was mandatory, but irrelevant to them, they just put in the fastest thing to type, which was normally 1 day. This lead to massive planning issues on early production runs and the blame was heaped on R&D.

The reality was though it was dumb to ask R&D to populate that field, it isn’t their job to know part lead times. So the solution was to default the field to 9999 years, meaning when someone did a planning run then any part with an R&D supplied lead time would be taking 10 millennia to deliver, and hence people would clean up the data. In this scenario people had given control to people who neither wanted it, nor had the authority to care about it, and because it was process oriented nobody really thought about the challenge.

Another example is the number of CRM systems that don’t have real-time matching and merging capabilities, what better place to identify that the ‘new’ customer is actually a historical customer than when the sales person is actually with them? Yet often the CRM delegates customer identification (the goal of MDM) to a post-transactional ‘data steward’, not considering fast, accurate customer identification as important. Some CRM systems don’t even have merge facilities, so even once a match is identified its a cumbersome process to create an accurate operational view.

Visibly choosing to have bad data is a good thing

Sometimes data quality isn’t important, sometimes its not actually something that drives business performance, and sometimes its actually the bad data that helps you address other challenges. As long as you making a business level statement that you are not going to have decent data in an area as the benefits are not worth the return then that is perfectly acceptable. In fact it tends to be my starting point in any data governance program: if the business doesn’t see the value in improving data quality, then don’t.

Another rule that I believe in is: give the business their bad data fast, and I mean fast in two ways, firstly the time delay from transaction to analytics should be low, secondly the time taken from them asking for the data to getting it should be low. Rather than spending weeks or months ‘cleaning’ the data, given them the unvarnished truth of their reality, including a dashboard that says just how bad the data is. I’ve found it quite interesting how quickly certainly things can change if the business gets to see how bad the data is. It also changes the dynamic from IT being to blame for Data Quality to being the people who highlight the issues, pushing the actual blame onto the one group of people who can make the operational changes required.

Another example of where IT can get it wrong on data quality was with a company where they’d tried to create a single global customer MDM for over a year, but the business didn’t care. Why didn’t the business care? Well because sales were regional, and there was zero value in a global customer MDM, and because the business models in some regions involved direct to consumer, while others went via distributors then there was absolutely no way to get folks onboard. So the solution? Concentrate on regional customer identification aligned to the the business models, sure they ended up with “more” Customer MDM instances, but they did so because that matched their business model.

A final example of where bad data is good came from a manufacturing company, we had sensors in testing monitoring how their product was working, occasionally we’d get a bad read from the sensors, the sort of stuff that was beyond the laws of physics, so we discarded those. Role on a couple of a months and the sensor team turns up and asks “where is our data on the sensor performance?”, in other words for that team the bad data was actually good data as it helped them improve the sensors.

Data Stewards and Data Quality efforts in IT are just sticking plaster over a control problem

I’ve set up Data Governance teams multiple times, and by far the most effective have been those, such as Know Your Customer (KYC) in a bank, where there is real authority around the process, not only the ability to push new requirements onto operations, but also very large regulatory stick which ensures the data stewards and individuals in this team are absolutely empowered in their challenge.

However at most companies the data stewardship task is a thankless one, post transactional folks fighting a deluge of upstream issues and getting blamed by folks downstream. It is why I’m a big believer in those cases of identifying the source of issues, so when people complain “this data is rubbish” you can say, “totally agree, go and speak to Dave, it’s his sales team”. The point is that a post-transactional data steward team that can only focus on internal pipeline cleansing efforts is indicative of a company that has made a conscious, or unconscious, decision that data doesn’t matter and data certainly won’t be driving the business forwards.

Data Governance is about the authority to approve bad data

So if data has value, and if companies truly want to become data driven then Data Governance has to stop being about ‘data stewards’ and back-office attempts to patch up the issues. It needs to be about operational governance and it needs to be based on the principle that:

Principle: The person who has the authority to change the business process to fix the data operationally, is the person who should be governing the data organizationally.

And that means changing the direction to being about being in control of data rather than data quality. The governance model should actively approve bad data, sounds a bit silly, but it really isn’t. Because this is about people. If someone is accountable for data, then its easy for it to fall back into the same areas. What you need is that if the data isn’t of a level of quality, where quality has only one measure:

How accurately does the data reflect reality?

Seriously, that is the only measure. The goal of data is to enable you to make decisions based on reality, every other data quality measure is secondary. This doesn’t mean that all data has to be totally accurate, just the data that the business relies upon to make its decisions. Indeed for some decisions the data quality may not need to be perfect.

These however are business decisions in a data driven organization. A proactive decision that a current set of data is not worth enough to improve. This should be a visible decision, one that can be communicated as a decision, and one where it can be challenged if another business area needs it improved and thus can look to fund the improvement of the data.

So rather than looking at Data Governance as about improving quality, think about it as being about who can formally approve the bad quality. From experience people rapidly change their approach when being asked to formally say “I don’t care about this data” rather than being able to complain about back-office processes

Data Ecosystems demand business control of data

Collaborative Data Ecosystems are fundamentally shifting the dynamics of business. Working between organizations to drive new collaborative value.

The business impact of data collaboration

The Capgemini Research Institute has identified significant business advantages from becoming a data collaborator. These are not things that can be resolved by back-office data stewardship roles, if a pharmaceutical company is collaborating with hospitals on COVID response then that isn’t going to work if the data quality is left to post transactional clean up. If a retailer is working with a digital fitness brand you need a whole business model established around data collaboration.

The reason why your current data governance process needs to be about control and not quality, is that when you work externally it is all about control, about how you work with others, about the trust you have in them to collaborate and the trust that they place in you. Your new Data Governance challenge isn’t to improve data quality, its to improve the business control of data and to make data the driver of business value in the organization that it needs to be to remain relevant.

The Data Driven Business starts with Data Control

Lots of people are saying that they want to be data driven, data powered, AI powered or other terms. But most are not really looking at what that means in terms of data control. Instead keeping historical power structures and just expecting them to work differently. We need to recognize that historically we have had process driven businesses, with process driven IT systems and Data was, at best, a secondary concern, and arguably not even that.

Chief Data Officers, CIOs and business leaders themselves need to look at the organization and ask the question ‘where can I have control of data’ and ‘who is empowered to take control of data’. Just like with processes this won’t be a single person for the organization, just as Sales Officers, Operating Officers, Finance Officers and Supply Chain offers have historically been accountable for process areas, so the new generation of leaders will need to be accountable first for data control, because without control there is no effective data execution, which means there can be no effective data collaboration.

Choosing to use legacy models to drive new behaviors is doomed to fail.

Data Governance isn’t about Quality, its about Control