Building Your Connected Data Catalog

Tony Seale
11 min readDec 9, 2022

--

How to connect all the data in your organisation together whilst at the same time leaving it where it is

by Tony Seale

What is a Data Catalog?

Gartner defines a data catalog as “an inventory of data assets [that is built] through the discovery, description, and organisation of datasets. A catalogue provides context to enable data analysts, data scientists, data stewards, and other data consumers to find and understand relevant datasets for the purpose of extracting business value.”

People often say that a data catalog delivers an ‘amazon shopping experience for data’. It is an exciting idea and most large organisations have already built a data catalog, but as is so often the case with data integration, the promised business value frequently remains unrealised. Why is that?

One of the biggest problems is the lack of standardisation. Different departments have different ways of organising and categorising their data - this can lead to confusion and inefficiency, as users may not know where to find the data they need.

Another problem with data catalogs is that they can become outdated quickly. As data sets are added, modified, or deleted, the metadata in the data catalog may not always be updated to reflect these changes.

Furthermore, data catalogs often lack the ability to integrate with other data management tools and systems. This can make it difficult for users to access the data they need and use it in their analysis and reporting.

Finally, and at a more fundamental level, in the real world everything is connected to everything else. Isolated datasets that do not connect to each other are often of limited use. More often than not, we need several connected datasets to extract any real business value.

We call this the data integration problem and it is a hard nut to crack. AI will play an increasingly important role in integrating our data, but if AI is to really help you, you must first teach it the specifics of your business. This initial tutorage is a human-centric process that is difficult to scale up within large organisations.

Therefore, to kickstart their AI and integrate at scale, organisations must perform a trick; they must invert the data integration problem. Organisations need to transfer the cost of data integration from the central teams collecting the datasets to all the applications who are providing them.

The process of inverting data integration starts with creating precise, sharable definitions of the concepts that can deliver business value to your organisation. These standardised definitions are exactly what is missing from most data catalogs.

By providing a standardised vocabulary for structured data, schema.org makes it easier for webmasters to provide clear, precise information about their pages. Schema.org has successfully flipped the integration problem on its head. In another article I explain how to build your own schema.org so you can do the same thing within the boundaries of your organisation.

Surprisingly, having something as simple as a shared definition of what a data catalog actually is, can enable standardisation, reduce dataset staleness, pave the way for integration with downstream tools and eventually deliver ‘Lego like’ dataset connectivity.

Your own schema.org

Let’s ask schema.org for its sharable definition of what a data catalog is: https://schema.org/DataCatalog

We can see that a data catalog is a type of creative work that contains a collection of datasets. This definition is really precise. It gives us properties like the editor, the published date and measurement technique. Some of those properties are complex and we can click on them and see the definitions of those too. For example, if we click on dataset then we get a detailed definition: https://schema.org/Dataset

How can these definitions invert the cost of building and maintaining a data catalog? Well, this definition of a dataset is so precise that you can actually use it to share your datasets on the web. For example (at the time of writing) the Trading Economics website publishes a dataset about UK 10 year bonds in schema mark-up that is embedded as an island of JSON-LD within their webpage:

You can see that this JSON-LD mark-up has a @type of Dataset and that the definition of this concept can be found in the @context of https://schema.org. In other words, the Trading Economics dataset exactly matches the dataset defined at https://schema.org/Dataset .

Trading Economics has taken on the cost of data integration in return for increased exposure to the search engines, they have done the hard work to map into the shared definition provided by schema.org. When a consumer like Google crawls the web and finds one of these datasets then they simply load the pre-integrated data into their catalog. Job done.

Google can then expose that catalog out for everyone to search. You can find Google’s experimental data catalog here and if I search for 10 year UK bonds I get back a list of datasets including, at the time of writing, the one provided by Trading Economics:

Schema.org provides a sharable definition of a dataset and then individual websites make their datasets available to any search engine by conforming to that shared definition.

The datasets stay on the individual publishers webservers, but they are now all discoverable from a central location. This is a scaleless architecture with simply no limit to how many datasets you can add. It is this ability to scale the human effort that distinguishes this architectural pattern.

This pattern can be use inside your own organisation to create a data catalog, invert the cost of data integration and, ultimately, to connect all of your data together! We can split the process of creating a decentralised data catalog into three phases.

Phase One: Decentralised Dataset Registration

The first phase represents quite a radical paradigm shift for most organisations. ALL applications within your organisation will need to provide a catalog of the important datasets they hold.

In this phase, the applications do not need to provide the data itself, just the information about what this dataset contains. You can think of this as being like providing the menu rather than the meal.

The data catalog schema mark-up can be provided as an island of JSON-LD in the applications website, or be returned from the applications web API, it can be a message on a Kafka topic or even just written to some shared folder on the network. A web API is preferable, but all that really matters is that the data conforms to your shared definition of data catalog and that the application registers the location of where it can be found with the central team.

The central data catalog provides your organisation with a place where everyone can search across all datasets in all catalogs, even though the catalogs themselves remain on the various applications servers. The storage and query mechanism for the central data catalog does not really matter, but it is simplest to use a Knowledge Graph as they are capable of ingesting JSON-LD directly.

A word of warning: when you start building your data catalog, it will be very tempting to slip back into old habits. Convincing the application teams to publish pre-integrated data will be hard. Superficially, it seems much easier to simply write a couple of scripts yourself, or to build a central page where all the applications can record the details of their datasets. Admittedly, this is a trap that I have fallen into myself.

However, to integrate at scale you must invert the data integration problem by distributing the cost of data integration amongst the data publishers. Technologically, this problem is solved - schema.org has already proven that and, conveniently, you can simply copy their opensource code. However, in most organisations, the ‘data culture’ required for this approach is very far from solved. Sooner or later, you will have to find out whether or not your organisation is capable of such widescale cooperation. Not all organisations are, and it is better to fail fast. There is no point spending money on the technology if you can’t achieve the cultural change.

So, success at the end of phase one in itself will be a significant achievement. It will be a single specialised knowledge graph that indexes the individual data catalogs provided by a high percentage of all the applications within your organisation. With this in place, you will now be able to see the size of the integration task ahead of you. You will have a measurable road map that you can use to chart your organisation’s journey from its current state to full data connectivity. Perhaps most importantly you will know that your organisation can collaborate at scale.

Phase Two: Modelling

So far, you have been able to reuse the shared dataset and data catalog definitions that have been developed by schema.org. To deliver real value, you will now need to model the specific concepts that impact your particular organisation’s bottom line. Schema.org may be able to give you a starting point, but eventually you will need to extend its model to fit your specific niche.

The key here is to think big but deliver small. You need to pick a problem that has not been solved by existing IT efforts but still seems feasible. For example some aspect of customer 360 or providing insight around IT inventory of applications, servers and software.

Use your newly created data catalog to identify the datasets that will be needed to solve the problem. Problems with a clear business sponsor and 3 to 5 datasets from different authoritative sources make ideal candidates.

Next you need to get the business users, data scientists and dataset publishers together. This group must work out the shared definitions that will need to be added into your internal schema.org website. This modelling of the concepts can be hard as it requires a deep level of collaboration. External standards can help here and, as a rule, favour simplicity and business understandability.

When the modelling is done, the data publishers should then attempt to break their datasets down so that they are well factored. This means that each dataset should correspond to one main concept defined in your schema.org. For example, you would have a person dataset, a building dataset, a client dataset and so on.

At the time of writing, schema.org lacks a property that can be used to link a dataset to the main concept that it relates to, so I suggest using the property conforms to for this purpose in the meantime.

Try not to get bogged down in discussions about physical and logical models as this is the siren call of traditional data integration raising its voice again. In this ‘brave new world’ of inverted integration cost, we only care about the shared concepts defined in your schema.org. Each application should publish datasets that map into those definitions — it is as simple as that.

Success at the end of this phase is all the interested parties agreeing upon a model that contains precise, shareable, definitions of all the concepts that are needed to solve the business problem.

Phase Three: Delivery

In phase three we finally move from reading the menu to eating the meal (at least for the data needed to solve this initial business problem anyway). When providing a download of the data, each application must map into the shared concepts that their datasets conform to. Publishers can use the datasets distribution property to specify the location of that download:

Again, perhaps some will be tempted to just expose the applications raw data in this download, but ignore the siren call and insist that each publisher must expose their distribution as JSON-LD, CSV-LD , Turtle or any other format that can map to the model defined in your schema.org.

Publishing applications are responsible for making sure that all the identifiers match up to the data provided in other datasets. Strong URL naming conventions and building a central ‘lookup service’ can streamline this ‘entity linking’, something I will cover in a separate article.

With all the datasets available, you can now create a specific Knowledge Graph that will load the data needed to solve this business problem. It is now possible for the delivery team to build the queries, machine learning models and reports that will deliver genuine business value. Success at the end of this phase is the delivery of the use case and perhaps the production of an independent report that can summarize the project for your CEO.

Rinse and Repeat

Now you return back to phase two and repeat the process, pick another problem to solve, ideally one that has some overlap with the data you already have, and go again. Keep iterating with this process, slowly ramping it up by allowing projects to run concurrently. The beauty of this pattern is that different groups can all work at the same time on different parts of the distributed graph.

Use the percentage of datasets with downloadable distributions to measure your progress. By the time you reach 30% coverage there can be little doubt that you will be working in a transformed organisation.

  • Your Connected Data Catalog can be neatly organised by the very concepts that matter most to your organisation; so it will be much easier for users to find the data that they need.
  • There is also much less chance of a decentralised data catalog becoming outdated; because the responsibility for publishing the datasets now lives with the same team that provides the data itself.
  • You can auto-generate the Semantic Layer of downstream reporting tools from the concepts defined in your schema.org; making it simple for users to access the data they need and use it in their analysis and reporting.
  • Most importantly, the datasets will all connect to each other; so users can extract real business value by combining several datasets together.

After a few iterations your Connected Data Catalog should begin to affect your machine learning projects, because your AI can use the data catalog to learn more about the specifics of your business.

Eventually your AI should begin automating the modelling and integration process itself. This will create an exponential reinforcing feedback loop that is capable of delivering truly transformative change.

--

--