5 Reasons Why I’m Paying Attention to Data Governance (and You Should Be Too)

Avoid reaching your “tipping point” by using modern approaches and technology to revisit this topic today.

Kelly Kohlleffel
Hashmap, an NTT DATA Company
10 min readFeb 8, 2022

--

“Data Governance” word cloud surrounding a rectangle formed by two arrows on a blue background

Let’s start with a question.

Which of these “governance” solution categories has already been fully solved and successfully implemented within your organization?

  • Metadata Management & Data Catalog
  • Data Observability & Data Monitoring
  • Data Discovery & Data Lineage
  • Data Quality & Data Usability
  • None of the Above
  • All of the Above

If you answered “None of the Above,” then you will find yourself in the majority of respondents in a recent survey that I sent out. Quite frankly, the fact that 64% of respondents answered this way isn’t surprising to me at all. So why am I not surprised by this?

For a data category that has been projecting a market in the billions for some time, and based on the amount of venture capital (VC) funding flowing into each of the categories listed above for both early-stage and more established companies, we should all have the keys to the data governance kingdom by now.

Since the days of “just collecting data” are long gone, everyone is ultimately chasing both usability and trust in their data products and services. At the same time, everyone is looking to simplify the process so they can shorten the window of time to get there.

This leads me to believe that we are at a tipping point (and I’ll come back to that). But first…

Why is this space so confusing?

Well, there are likely quite a few reasons that could leave someone scratching their head or wondering where to go next when exploring the topic of data governance. While there are more than what I’ll be able to list, I can help you get an idea of how to tackle some of the most common data governance questions.

1. Modern Software Companies Are Avoiding the Term Data Governance

Although data governance is a key aspect of the modern data stack, it’s evident that modern data stack “governance” solutions downplay governance as a concept. Just check out a few websites for companies in this area. To be honest, I don’t blame them at all. I actually think that the strategy is sound when you consider the amount of time and money that has been poured into data governance projects over the last 15–20 years. Previous generations of solutions have provided limited return or success — remember when MDM was a “hot” topic?

Beyond the factors of time and money, historically, there have been significant changes to the processes that are required to make a data governance project successful. The key is, and this isn’t always easy, to find someone that will stand up and say, “It was worth it.”

2. Be Ready for Category Crossover

There is no all-in-one acronym that defines governance solutions, and there is a significant amount of overlap across categories. Plus, vendors in this space don’t want to be categorized, stigmatized, or pigeon-holed into a single category or product area in order to avoid the stigma of being known as “just” a data catalog or a data governance solution. Both category descriptions have been around for many years and multiple generations of products have been aligned with previous data approaches (think traditional on-premise, Hadoop, etc.) versus the approach that’s needed for a modern data stack.

When looking at governance solutions for a modern data stack, you’ll see a combination of the various categories provided by a single technology vendor. So what do you do with this information? Start with determining if, and how, a particular product is fit for being purposed across a range of capabilities, such as CAT-DISCO-LIN-QUAL-META, OBSERV-QUAL-DISCO, or META-QUAL-USE-CAT. Then, look at how each compares as a whole.

Additionally, if you are just now taking another look at solutions, expect to find your greatest number of options, from both a technical and commercial standpoint, in the cloud and delivered as a SaaS or PaaS offering. Why is that? Well, the market demand for SaaS only going up — take a look at another result from our recent survey.

Speaking of category crossover, solutions in this broad category are looking for acceptability with a broader audience consisting of more than just data “custodians” and angling for user crossover as well — examples include:

  • Data analysts
  • Data engineers
  • Data scientists
  • Analytics engineers
  • Business analysts
  • Plus Traditional IT & Data Stewards

3. What About My Existing Cloud Data Platforms and Tools?

Relax, you don’t need to throw out your existing data stack. It’s imperative that I point out that many of the modern cloud data platforms and tools that we use today such as Snowflake, Databricks, Big Query, Fivetran, Matillion, and dbt, just to mention a few, either already have added, or are incrementally adding, pieces and parts of the data governance puzzle to their offerings. These types of offerings include making metadata more available and accessible, enhancing lineage traceability and end-to-end visibility, and providing more active monitoring and observability.

For instance, check out Snowflake’s private preview feature called Object Dependencies. The purpose of this feature is to ensure that “in order to operate on an object, the object that is being operated on must reference metadata for itself or reference metadata for at least one other object.” Expect to see more examples of services like Snowflake building governance capabilities and features directly into the platform.

4. How Does Data Mesh Fit In?

The introduction of the concept of data mesh is pushing many organizations to contemplate using “decentralized data product ownership” vs “centralized data as an asset” in their approach to data governance If you haven’t had the chance to dig into data mesh yet, which was introduced by Zhamak Dehghani with Thoughtworks in 2019, I recommend her O’Reilly eBook which Starburst makes available here). Also, Kent Graziano and I focused on data mesh during a recent Hashmap on Tap podcast breaking it down and discussing its applicability (listen here).

A core concept of data mesh is that when you move to data as a product from a centralized data as an asset approach you also move the governance to those individual data products which would include at a minimum the metadata associated with that data product.

5. Data Quality FTW

With all the new buzzwords and new categories being promoted, one old standby continues to hang in there. Data quality, as unusable, trusted data, is still the top concern overall for organizations, and this is arguably the toughest aspect to get right in a data product and for a data program.

Here’s a survey stat that emphasizes the continued importance that the community puts on data quality:

Not only did data quality rank the highest in terms of importance out of the four categories, but more than half of respondents (57%) said that data quality was their most important concern related to data governance. Next, you’ll see data observability taking the silver by one-tenth of a point over metadata management and data catalog.

Realistically, you could make the case that if your data observability is strong, then you have better data quality, reliability, freshness, etc., and the same thing goes for the other categories. If I’ve got a reliable, trusted data catalog and consistent metadata management, then my data quality score should go up.

My point is that data quality is highly important, but it shouldn’t necessarily be considered mutually exclusive from the other categories. Moreso, it’s a foundational aspect of the other areas.

Similarly, take a look at the results of a recent LinkedIn poll that I ran. Data quality is once again in the top importance spot going into 2022, but these results came back a bit different. Metadata management & data catalog ranked second, with data discovery and data lineage taking the bronze.

It is important to note that the LinkedIn poll didn’t allow for definitions for each of the various categories, and it assumed that everyone already had familiarity with the categories and understood the differences. I’m not going to go into detail on that right now, but to help, you can reference the quick highlights below…

Back to the Tipping Point (and Why You Should be Revisiting Data Governance Now)

If you are like most, you are already at the tipping point. Your data product, service backlog, and demand are continuing to grow by the day. These are not at a steady state by any means in terms of the complexity of data sources, datasets, data types, transformation requirements, overall combinations, and use case patterns.

All of those dimensions continue to get more complex and more demanding, and this can apply even more pressure on the system to deliver useable and trusted datasets, particularly with appropriate governance controls driving those products and services.

If you haven’t checked out some of the recent, more modern companies and their offerings that are making a splash in this area (or you are now at the point of “what’s next?”), I’d like to provide a few solution perspectives and also let you know what’s caught my eye recently.

But first, I want to provide you with some thoughts on criteria that you can use to evaluate offerings.

Start here >>> Check out the Castor blog

It’s impressive what the Castor team has done to contribute to the data community. I highly recommend taking a look at the Castor blog where you’ll find benchmarks, tips & tricks, and other really useful data-focused information.

Additionally, they have an RFI/RFP template available for data catalogs that covers multiple categories, use cases, and features along with commercial alignment.

Included are the following:

  • Product
  • Automation
  • Ease of Use
  • Collaboration
  • Security
  • Deployment
  • Commercial Alignment
  • Plus Discovery-focused vs Control-focused, search and popularity ranking, query history, lineage, tagging, Q&A history, Data Analyst and Data Science friendly

I am also a fan of the thoughtful way they approach their benchmarks and their Data Observability Benchmark and Key Features is a great example. They consider the following criteria:

  • Deployment support
  • Monitoring framework
  • Threshold setting
  • Interface type
  • High cardinality support
  • Real-time data monitoring
  • Automated features
  • Data Sources
  • Integrations
  • Alert destinations
  • Security
  • Metrics categories tracked
  • Root cause analysis
  • Community
  • Plus: how comprehensive is the monitoring, freshness, volume, distribution, lineage, overall data health, ability to quickly and accurately identify (and potentially prevent) data problems, simple, clean instrumentation, anomaly detection, effective alerting, quick root cause analysis, overall workflow automation, data engineering friendly, real-time monitoring, pipeline testing

There are also some general categories that I suggest you consider regardless of the type of solution that you’re considering and that will apply to just about any component in the modern data stack.

Also >>> Subscribe to Metadata Weekly

Each week, I look forward to reading the perspectives from Prukalpa, Co-Founder at Atlan. In the Metadata Weekly newsletter, Prukalpa curates her recommended reads and shares her thoughts on everything around the modern data stack and metadata. Recently, she focused on the future of the modern data stack including dash mesh, metrics layer, reverse ETL, active metadata, and more.

Check out one example, from a recent Weekly, of one of Prukalpa’s viewpoints that I really liked:

“We are entering the third generation of data catalogs. It’s a fundamental transformation from the prevalent old-school, on-premise data catalogs to the modern approach which is built around diverse data assets, big metadata, end-to-end data visibility, embedded collaboration, and the principle that context needs to be available wherever and whenever users need it.”

— Prukalpa, Co-Founder at Atlan

Now What?

I’ve provided you with just a few of these resources so that you can begin to understand the concepts and importance of strong data governance in relation to the modern data stack. Keep an eye out for my next article–I’m going to clue you in on a few companies that are on my radar for making waves in the data governance space.

My goal is simply to make sure that instead of being overwhelmed by your tipping point, you feel empowered and ready to embrace change and its opportunities for learning and growth.

Additional Resources

Ready to Accelerate Your Digital Transformation?

At Hashmap, an NTT DATA Company, we work with our clients to build better, together. We are partnering with companies across a diverse range of industries to solve the toughest data challenges and design and build data products — we can help you shorten time to value!

We offer a range of enablement workshops and assessment services, data modernization and migration services, and consulting service packages for building new data products as part of our service offerings. We would be glad to work through your specific requirements. Connect with us here.

Kelly leads the Go To Market team (sales, marketing, and alliances) at Hashmap, an NTT DATA Company, and is also the host of Hashmap on Tap, Hashmap’s weekly podcast. He’s been in the data game for over 30 years and prior to Hashmap spent time at both Hortonworks and Oracle. You can connect with Kelly on LinkedIn and follow him on Twitter.

--

--

Kelly Kohlleffel
Hashmap, an NTT DATA Company

Avid technologist, open-source software standard bearer, devoted husband and dad