Data catalogs. Part 1. The Spectrum of data catalogs

Ivan Begtin
4 min readJun 5, 2022

--

Photo by Aleksi Tappura on Unsplash

Following short articles, I will write about data catalogs related to corporate, scientific, and open data topics.

Same term data catalog, used in very different contexts — scientific data repositories, open data portals, and corporate metadata catalogs. It’s important to know the difference and similarities between them.

There are three types of data catalogs:

  • Scientific data repositories — limited control over primary data sources through consensus and standardization of data, the gradual creation of reference databases, the formation of ontologies, etc.
  • Catalogs of open government data — focused on open data, most common government data, weak control over primary sources, and, as a result, low quality of data and their high “littering” with useless data that have no practical application.
  • Corporate [meta]data catalogs — high control over primary sources, significant work on their formation, accessibility only within severely limited internal access ecosystems, and, in some cases, partial publicity through “data marts“.

Scientific data repositories

Scientific data warehouses, data hubs, and data-sharing initiatives have been typical since. The requirement for reproducible research and scientific discoveries with scientists from many countries caused many scientific data projects.

One of the earliest scientific initiatives was World Data Centre, launched in 1957–1958.

The World Data Centre (WDC) system was created to archive and distribute data collected from the observational programmes of the 1957–1958 International Geophysical Year by the International Council of Science (ICSU). Wikipedia

Other projects evolved in parallel and later. For now, most of them are cataloged in the Re3Data project.

Re3data is a global registry of research data repositories that covers research data repositories from different academic disciplines. It includes repositories that enable permanent storage of and access to data sets to researchers, funding bodies, publishers, and scholarly institutions. re3data promotes a culture of sharing, increased access and better visibility of research data. The registry has gone live in autumn 2012 and has been funded by the German Research Foundation (DFG). About Re3Data

For now, there are many data catalogs, data repositories, and portals created by scientific organizations. Most of them were created using open-source software like DataVerse, DSpace, EPrints, Fedora, CKAN, and Nesstar. Quite often, scientific data repositories are the result of the evolution of the repositories of scientific publications, and such repository software like EPrints or DSpace are examples of such repositories.

Most of them have the following features:

  • permanent identifiers, like DOI;
  • scientific/bibliographic data standards like Dublin Core, OAI-PHM, DataCite schema, DDI, and many others;
  • support of any types of data and documents, including very special scientific file data types.

Catalogs of open [government] data

Catalogs of open data are also often mentioned as open data portals. Government agencies and civil hackers around the world created hundreds of such data catalogs with the following principles in mind:

  • open by default
  • open licenses
  • free reuse of data

International initiatives like Open Data Charter and Open Government Partnership were launched last decade to support the openness of the government with the data.

A lot of data portals were created using open-source software like CKAN, DKAN, JKAN, uData, and e.t.c. Other use data platforms like Socrata or Opendatasoft.

Open data portals have some similarities with scientific data repositiories. Some of them use same software like CKAN. They also support most of existing data files formats and also they support non-data files too.

Open Data policy of many government includes regulation to publish any data in any data format, so even if data existed as PDF files with tables these files were published on open data portals. Quality (matureness) of open data measured by 5 stars open data and by Open data maturity report by European Union.

Open Data portals rarely support scientific data repositories features like permanent identifiers and scientific metadata standards. Instead metadata standards like DCAT were created by W3C .

Genealogy of open data portals well explained by Tim Davies in article Technology: A genealogy of data portals.

Corporate [meta]data catalogs

In recent years we could see the rise of corporate data catalogs. Through they are named “data catalogs” actually, most of them are metadata catalogs with many features that help to collect, document, analyze multiple data sources and extract metadata about databases, tables, table columns, and other data and data processing related artifacts like data pipelines, dashboards, and e.t.c.

There are many commercial products like Collibra, Atlan, Castor, Alation, and many others. Also, there are many open-source tools like Amundsen, Datahub, and OpenMetadata.

All of them are database-focused. Even the simplest of the corporate metadata catalogs support dozens of database engines, and most of them support SQL databases.

Corporate metadata catalogs are not public, and they are created for inner usage by the corporate and accessed by its data analytics and data scientists teams. Some of these catalogs have advanced features like:

  • integration with modern data stack;
  • automatic data documentation;
  • semantic data types identification.

--

--

Ivan Begtin

I am founder of APICrafter, I write about Data Engineering, Open Data, Data, Modern Data stack and Open Government. Join my Telegram channel https://t.me/begtin