Dataset search engines as global data discovery tools

Ivan Begtin
4 min readMay 19, 2022

--

Search engines have a long history, you could easily find text, web pages, images, video, news, and some other content using global search engines like Google or Bing.

You could also use some not so well known search engines like You or DuckDuckGo.

But there are only a few search engines that support the search for datasets. First of them is Google Dataset Search

Google Dataset Search

Search for “COVID-19”

It’s a research project created by Google in September 2018 and it’s been out of beta since January 2020.

It’s one of the most complete search engines and at its core is a Schema.org Dataset datatype.

It’s described as JSON data inside datasets HTML pages.

Schema.org dataset definition of the https://data.gov.uk/dataset/11581656-7cfc-4ee5-967b-0cc04fcae06b/vocational-qualifications dataset

I can’t say how complete the Google Dataset search index is, but it’s quite big, and the most common indexed sources are:

And many others, all of them export Schema.org Dataset definition of each available dataset.

But you will not find any open data portal without Schema.org support. An example data.gov.in, an open data portal of the Indian government. It’s quite big, with more than 534 118 resources and it’s indexed by Google Engine, but it’s not indexed by Google Dataset search.

Google dataset search for data.gov.in

Just the same about the Russian government’s open data portal data.gov.ru, Indonesian data portal data.go.id, and much other open data government and non-profit data portals.

The idea that site owners should describe datasets by themself using Schema.org has its limitations. Not every data portal owner knows about this standard, and not every dataset catalog software supports Schema.org definitions.

DataCite Search

Another search engine is Datacite Search (search.datacite.org). It’s much more focused on research usage by scientists.

DataCite Search for COVID-19

DataCite assigns DOI identifiers to digital resources provided by its users and indexes research repositories for texts and data. It has a metadata schema to describe datasets and other digital artifacts related to scientific papers.

DataCite indexes data from scientific data repositories like Zenodo, UBC Library Open Collections, Data Inrae, and Harvard Dataverse.

But It doesn’t index datasets outside DOI ecosystems, so commercial datasets or open government data datasets are outside its DataCite search.

Search engines limitations

Both search engines not only index all available datasets, but they mustn’t index dataset contents. Both search databases are built with basic metadata about a dataset. Datacite uses data repository metadata and its metadata schema, Google Dataset search uses Schema.org Dataset definition.

But you can’t:

  • a search of field-level metadata, for example, if you would like to find data with a certain type of the data or field name
  • a search of dataset contents, it’s not indexed right now by any search engine.

Even metadata search is limited by its usage. It’s easy to cite the dataset that you find, but it’s not the same easy to use it. Dataset description could be incomplete, downloading data may require authentication or other additional steps to be done.

Data discovery tools are different

A lot of data discovery catalogs exist and are used for corporate needs. A lot of startups and open source solutions like Open Metadata, Amundsen, and Datahub help to discover and document database tables and other data engineering artifacts like workflows, visualizations, and e.t.c. Some of them support file-based datasets too. But there is nothing like that in open data portals or scientific data repositories world.

You could search for certain field names, types, and semantic types (glossary terms) in the corporate data catalog and you can’t do the same for open data. Global data search engines like DataCite Search and Google Dataset Search don’t help with this task.

But it isn’t impossible to implement. There are several API specifications of major open data portal products. It’s CKAN API, OpenDataSoftAPI and Socrata API. For example, Socrata API is actively used by startup Splitgraph for public data catalog with data structured as Postgres databases.

Search engine as a global data discovery tool

In my vision search engine for datasets should be a data discovery tool. Search for basic metadata is not enough. Extended search should include the same options as they are in popular corporate data catalogs:

  • Search for extended metadata, extracted from data catalogs API
  • Search for text/data inside datasets.

Also, search engines couldn’t be limited only by scientific data or by Schema.org standards. Any global data discovery search engine should support all the ways of datasets indexing: Schema.org, known protocols/API, DCAT, OAI-PHM, and many other standards, protocols, and specifications.

I think that it’s important to integrate semantic data types into search engines and to be able to search data with a certain semantic data type.

Please share your vision of how dataset search engines should work and what is important for your data discovery tasks.

--

--

Ivan Begtin

I am founder of APICrafter, I write about Data Engineering, Open Data, Data, Modern Data stack and Open Government. Join my Telegram channel https://t.me/begtin