A Deep Dive into the Data Governance Ecosystem — The Tooling Landscape 2/3

Published in

Illuminate Financial

7 min readApr 13, 2023

In the last blog, we discussed the evolution of data governance (DG) as a practise and where we think the ecosystem is heading. If you haven’t had a chance to read it, feel free to check it out here.

This post will be the first of two concerned with dissecting the data governance tooling ecosystem, where we’ll be breaking down several key segments and exploring the major trends shaping the businesses operating in these buckets.

The Data Governance Software Ecosystem

When we were pulling this map together, one thing was very clear — the data governance product landscape, contrary to popular belief, goes beyond just data cataloguing.

It encompasses tools from several adjacent disciplines — data management, data operations, compliance, and data security/privacy — and can be a confusing space to navigate.

What constitutes a data governance solution [or falls under its umbrella] is a very subjective topic and thus building an objective landscape was a next to impossible task. Nevertheless, before plunging in, two things to note:

1. The boundaries are largely artificial. Many of the companies included address multiple categories, and in some cases are expanding to other buckets, which made it difficult to ascribe a perfect label. We’ve endeavoured to place companies in categories that we understand to be their core focus. E.g., we placed Talend in the data integration bucket, but we are aware that they have their own data catalogue, data quality and mapping solutions.

2. There’s an abundance of data governance products in the market, so it goes without saying that this map is not exhaustive, but rather a function of the companies we have engaged with and/or are aware of. If there are any key players that we may have omitted, please let me know and I’ll happily update the landscape.

Without further ado, let’s dive in.

Data Catalogues

At the core of any data governance product tool kit lies a data catalogue (DC) — these are searchable inventories of all the data assets across all data sources within an organisation. These tools use metadata (descriptive information about data) to provide context on those assets like storage location, format, owner, lineage etc. This enables users to understand what exists in their data estates. DCs generally come in one of two forms: discovery focussed DCs and compliance focussed DCs.

1. Discovery focussed DCs are optimised for analyst productivity/data discovery and are geared towards middle-market businesses.

2. Compliance focused DCs are optimised for security and access control which are chief concerns for enterprises. They embed access protocols to ensure the compliant use of data.

The line between the two types have become blurred in recent times with many tools offering functionality that supports both use-cases.

Data lineage (DL) on the other hand is concerned with tracking and visually representing the flow of data from source to destination within an organisation. DL tools provide a record of where data assets originated, the changes undergone over time, who’s accessed it, and its relationship with other assets. These tools exist to ensure organisations can trust their data. Whilst there are a handful of standalone data lineage solutions, like Manta and Datakin, most data catalogues [and some data quality solutions] have lineage functionality baked in.

Specific to data lineage, we’ve observed a couple of defining trends. For one, there’s been an emergence of industry specific tooling, like Mapor for financial services and Octopai for healthcare.

Secondly, these industry-specific lineage tools have been hard to adopt because they typically require the whole organisation to conform to a single standard way of mapping lineage.

As one would expect, automating the mapping process (akin to data discovery, but now discovering the lineage of data) is a feature in high demand, whether we will see a material breakthrough here is yet to be seen (and more likely that automation will be use case/industry specific).

Data catalogues wholistically have come a long way since their inception in the 1990’s. Led by the likes of Informatica and Infognix, they started off as on-prem, static data inventories that relied on manual input to curate and document data (DC 1.0). However, increased data volumes, brought about by the emergence of cloud technology, meant that these traditional tools rapidly became antiquated.

The next wave of solutions aimed to automate a significant portion of the documentation process and introduced additional features (e.g. tracking lineage and managing access), whilst enhancing metadata collection (DC 2.0). These products were obvious upgrades to traditional tooling but were not perfect.

Data teams were rapidly accruing documentation debt (masses of data that needs to be catalogued, classified and mapped) as retrofitting data catalogue solutions was a huge pain (especially for legacy firms). As a result, catalogues quickly became out of date and tools were reduced to shelf-ware. It also didn’t help DCs were hard to embed in workflows. The prospect of using third party solutions/websites to browse metadata was disruptive to productivity.

We expect that data catalogues [and standalone, industry agnostic lineage tools] as we know them today will become largely commoditised, as they serve as a natural foundation for a data governance tool kit. We’re already starting to see this trend play out. Some businesses have gone the acquisition route e.g. Astronomer’s acquisition of Datakin and Collibra’s acquisition of SQLdep. Whilst others have gone the in-house product expansion route to supplement their core products — a la Immuta supplementing its data security offering and Tableau its business intelligence offering.

That said, there will be a small crop of promising early stage players with enough critical mass to expand to other data governance/management buckets to open up new revenue streams.

Alongside the consolidation of DC 2.0 tools, we’re at the cusp of a paradigm shift in the data catalogue tooling market. Dubbed, data catalogue 3.0 or active metadata mgmt., there’s a new wave of companies such as Castor, Atlan and Select Star that are pushing the boundaries set by predecessors. Besides the current DC product functionality, these tools are laser focussed on productivity and equipping teams with the data they need to do their jobs.

Features like integrated intelligence, (e.g. notifying database owners to describe and label their data when a new base is created), NLP technology (to aid in the documentation process) and better integrations will enable these solutions to become embedded into analyst workflows.

Data Quality & Observability

Data quality (DQ) is arguably one of the oldest data governance competencies and is by no means a new priority, but recent fundraising activity indicates that it still represents a burning problem for data teams. The augmentation of data analytics has exposed a multitude of DQ challenges and considering the sheer volume of resources being poured into big data initiatives, these challenges are no longer tolerable.

Data quality is concerned with the overall accuracy, completeness, and consistency of data within an estate. It is the degree to which data conforms to expected standards, such as accuracy, comprehensiveness, timeliness, legitimacy, and reliability.

DQ tools often take the shape of a rules-based monitoring system that keeps track of live data in pipelines and delivers an alert when it notices an anomaly.

It’s important not to misconstrue data quality with data observability (DO). At a high level, DO focuses on the overall health of an organisation’s data across its estate. It in theory goes beyond monitoring data estates for anomalies but is also concerned with providing insights on how to resolve breakages and prevent future incidents.

Functionality-wise, DO systems like Monte Carlo Data and BigEye usually comprise of tools for automated monitoring, automated root cause analysis, data lineage, and data health insights.

I’ll be the first to admit that the line can sometimes be blurred between the two disciplines with many data observability tools and data quality tools offering ostensibly similar functionality. Nevertheless, the snowballing need for greater data visibility capabilities across data estates has led to an influx of funding in the space with the likes of Monte Carlo ($236m raised), BigEye ($66m raised) Anomalo ($39m raised) all with substantial fund raises in the last 18 months.

Monte Carlo Data $135m Series D — Source

Data quality and data observability tools aren’t perfect, e.g. alert fatigue is a common challenge users often face with these systems, where they are bombarded with notifications regarding breakages, and have no logical way of triaging them. Some enterprises have struggled with this so much that they’ve had to build their own code on top of these tools to filter the alerts that come through.

So, it goes without saying that there’s still room for improvement in DO and DQ solutions. The future of this space will likely involve further adoption of AI to further automate these types of challenges, alongside HITL methodologies to deal with edge cases. Likewise, the implementation of data quality practises across the data lifecycle, will become more commonplace, as businesses aim to maintain integrity and reliability throughout their estates.

In the next instalment of this deep dive, we’ll breakdown the data access management, data mesh enablement and data-ops sub-categories.

About Illuminate Financial

Illuminate Financial is a thesis-driven venture capital firm dedicated to fintech and enterprise software companies building technology solutions for financial services.
Illuminate’s LPs and strategic partners include some of the largest and most well-respected banks and financial institutions who supply diverse market and industry knowledge.
Website | LinkedIn | Twitter

A Deep Dive into the Data Governance Ecosystem — The Tooling Landscape 2/3

Written by Joshua Olusanya