The Semantic Data Catalog: Unleash the Power of Ontologies and Vector Search to Navigate Your Data Mesh

Published in

Labs Notebook

11 min readFeb 28, 2023

Keywords: Accenture Labs, Semantic Search, Data Catalog, Vector Search, Data Mesh

By: Nimrod Busany, Hananel Hadad, Dan Klein

As data becomes increasingly foundational to organizational operations, data catalogs have become crucial for managing and utilizing data assets. Data catalogs allow users to search for and discover relevant data based on various criteria such as the name of the data assets, its metadata, and associated business terms. Existing data catalogs often rely on simple string matching techniques, which can limit in their ability to understand the meanings and relationships of the concepts in the catalog. This can lead to missing potentially useful data assets when searching the catalog.

To add to the complexity, new paradigms for managing data, like the Data Mesh paradigm, require companies to build data products that are organized by domains and managed by different organizational units. These data restructuring introduces new challenges when searching for data, as data products created by different domain experts could lead to similar concepts being expressed differently.

The semantic data catalog offers a solution to the above challenges as it provides a framework for organizations to manage distributed data products, which can be effectively searched from a centralized public catalog using a semantic search engine.

In this blog, we introduce the semantic data catalog framework, which leverages the power of ontologies, ontology embeddings, and vector search to improve data discovery and management using semantic search.

Potential benefits of using a semantic data catalog include:

Improved search accuracy and relevance: By understanding the meanings of concepts and the relationships between them, a semantic data catalog can provide more accurate and relevant search results.
Enhanced data discovery: A semantic data catalog can help users find data assets that they might not have been aware of or that might not have been easily discoverable using traditional techniques.
Better data organization and classification: By leveraging ontologies as key part of the semantic data catalog, it can assist in the organization and classification of data assets, leading to a more structured and consistent data landscape.
Enhanced data governance: By providing clear understanding of the meanings and relationships of data concepts, a semantic data catalog can help organizations to better manage and utilize their data.
Greater efficiency and productivity: By making it easier for users to find and access relevant data, a semantic data catalog can help to improve the efficiency and productivity of data-driven tasks.

Below, we present the general approach and provide example code to build a simple semantic data catalog. As there are many challenges in building the catalog, this blog will not solve all, yet will try to highlight things that need to be addressed.

Building a Semantic Data Catalog. We start with a high-level overview of the steps to build a semantic data catalog. First, we create an ontology catalog, which holds an ontology for each of the data assets. The ontologies represent the concepts and relationships within and between the data assets. Second, we train an ontology embedding model on the catalog to generate numerical vectors, or “embeddings” that capture the meanings and relationships of the concepts. Third, we load the embeddings into a vector search engine, which allows users to search for data assets within the catalog using textual search queries. Finally, given a query by the user (be it a data engineer, data scientist, or business user), we use the model to embed the query and use the vector search engine to retrieve to most relevant concepts. The figure below presents a high-level architecture of the Semantic Data Catalog.

Unlike traditional methods for searching a catalog based on string matching and pre-defined search criteria, the semantic data catalog can provide more accurate and relevant search results using ontology embedding techniques.

The embeddings could be tuned over time to accommodate for changes in the catalog or to improve the search results even further and allow organizations to better utilize their existing the data assets.

In this blog, we describe the main steps that are required to implement a semantic data catalog. The running examples are for illustration purposes and not meant to serve as an implementation of a full-fledged catalog.

Step I: Create an Ontology Catalog

Data catalogs typically hold various types of data assets ranging from structured datasets (e.g., csv tables), to highly normalized datasets (e.g., Third Normal Form), to semi-structured (e.g., json schemas), to unstructured datasets (e.g., documents on a blob storage).

To be able to effectively search across such a large variety of data assets, we create an ontology per data asset. An ontology is a formal representation of a set of concepts and relationships within a domain, typically expressed using a formal language. In our context, we create ontologies to model the concepts and relationships within a data asset, and to link between concepts in different data assets. These ontologies will serve as the semantic layer on top of our data assets.

Creating an ontology for a data asset requires a data modeler and an expert that understands the data asset to work jointly until a suitable conceptual ontology is achieved. While this process may require significant human labour, it can be greatly accelerated by leveraging tools that automate the ontology creation process (e.g., Anzo, Stardog have some offerings), and by borrowing concepts from public ontology repositories (e.g., FIBO, D3FEND).

Running Example. In our running example, we will use an imaginary company with three data assets, which are depicted by three public ontologies: Edas (Academic domain), GoodRelations (Commerce domain), and Pizza (Food domain).

A high-level view of the D3FEND Ontology in *prote’ge’*

Remark. While several languages for specifying ontologies exists, OWL (Web Ontology Language) by the W3C (World Wide Web Consortium) stands out due to its simplicity, available editors (e.g., prote’ge’), and support by data management platforms (Anzo, Stardog, Neo4j, TopBraid, to list a few). Most importantly in our context, several state-of-the-art ontology embedding techniques require the ontology to be written in OWL. This is crucial as we demonstrate in the next section.

Step II: Train an Ontology Embedding Model on Your Catalog

After creating an ontology catalog, we can create an embedding for each of the concepts defined in our ontologies (classes, data properties, object properties, etc). To this end we can either use a pre-trained large language model (Word2Vec, BERT) or an ontology embedding techniques (OWL2VEC*, EL Embedding, and Quantum Embedding).

In this blog, we will use OWL2VEC*, an open-source library by Oxford University. OWL2VEC* takes a catalog of ontologies and builds a model that embeds the ontologies. Details on how this is done, and access to the code can be found here.

We place the ontologies in the catalog into a single directory and train the model on the ontologies. After the training is done, the embedding model is stored into a file.

Running Example. We clone the project OWL2VEC* and place our example ontologies into the ./ontologies directory within the cloned project. We configure the tool to use Word2Vec, a pre-trained language model, and run the following command to fine-tune the model according to our ontologies.

# Clone the OWL2Vec* project, follow the installation instructions  
# git clone https://github.com/KRR-Oxford/OWL2Vec-Star.git 

# 1. Create dir 'OWL2Vec-Star/ontologies' and place the ontologies. 
# 2. Create an empty directory 'OWL2Vec-Star/output'
# 3. In 'OWL2Vec-Star/default_multi.cfg':
#   a. Uncomment the line starting with "pre_train_model =" and set the path of the local file of a pre-trained WORD2VEC model (inlude all extracted files in the same directory)
#   b. Comment the line starting with "cache_dir = "
#   c. change the training parameters: embed_size = 200; epoch = 200

# The following line will use OWL2VEC* to fine-tune the word2vec model on all ontologies in ./ontologies, and saves the trained model in ./output/multi_word2vec
!python OWL2Vec_Standalone_Multi.py --ontology_dir ./ontologies --embedding_dir ./output/multi_word2vec --URI_Doc --Lit_Doc --Mix_Doc
# This may take a few minutes 
# Install faiss, a vector search library by facebook
# !pip install --upgrade pip
# !pip install faiss-cpu

The resulting model is stored in ./output/multi_word2vec.

Remarks.

By default, OWL2VEC* is trained to create embeddings that are used for link predications between ontological concepts (inheritance and class membership). Still, we found its embedding quite effective for search. This is likely due to its underlying language model (Word2VEC) that it uses, which was trained on the entire Wikipedia.
The embedding produced by OWL2VEC* is highly affected by the language model and ontology embedding configuration parameters. It is therefore advised to experiment until an appropriate embedding is produced.
To test your embedding, you can create a small test set of queries and matching concepts. To measure the accuracy of the embedding you may use the common measures for search engines, such as, Mean reciprocal rank and Hit Rate.

Step III: Load the Ontology Embeddings to a Vector Search Engine

After creating an embedding for our ontologies, we are ready to load the model and the ontologies into a vector search engine. To this end, we can either use a vector search package like FAISS or a full-fledge vector database like Pinecone. In a vector search engine, we create searchable indices. Each index stores a set of vectors that the can be searched efficiently.

To populate our index, we iterate over the entities of our ontologies and use the model to generate embeddings which are added to the index.

Running Example. In the following code snippet, we create an index that includes the classes of all ontologies, as well as, a separate index per ontology.

First, we define some helper functions to normalize and embed text.

from typing import List, Tuple, Dict, Union
import numpy as np
from gensim.models.word2vec import Word2Vec

# split string in camel notation to a list of words, and convert it to lower case as words in our model aer lower-cased
def parse_camel_plus(string: str) -> List[str]:
    matches = re.finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', string)
    words = [m.group(0) for m in matches]
    return [word.lower() for word in words]

# function to produce an embedding for a sentence by averaging the word embeddings 
def embed_text(sentence:Union[str,List[str]], model:Word2Vec) -> np.ndarray:
    # make sure sentence is a list of words
    if isinstance(sentence, str):
        sentence = sentence.split()

    # init a vector of zeros to sum word embeddings, and defined known_words counter
    emb_sum = np.zeros(model.vector_size, dtype=np.float32)
    known_words = 0

    # sum up embeddings of words known by the model
    for word in sentence:
        if word in model.wv.index_to_key:
            emb_sum += model.wv.get_vector(word)
            known_words += 1

    # if there was at least one known word, return average embedding of all words
    if known_words:
        return emb_sum / known_words
    else:
        raise ValueError("no words recognized in the given sentence")

Then, we prepare the embeddings. We load the model and the ontologies. We use owlready2 to parse and extract the classes from the ontologies, and use the model to embed the classes. We create two embeddings for each class, one based on the encoding of its IRI, and another based on its label.

import re
import gensim
import owlready2

# use the environment in which OWL2VEC* is installed and load model and ontologies
model = gensim.models.Word2Vec.load('./output/multi_word2vec')

ontologies:List[owlready2.Ontology] = [owlready2.get_ontology('./ontologies/'+onto_path).load() for onto_path in os.listdir('./ontologies')]

# Create an embedding dictionary per ontology. 
# Map each ontology to a list of mappings for its classes, from class identifiers (IRI) to vector embeddings. 
# Create two embedding per class based on the IRI and its label.
onto_iri2embeds:Dict[str,List[Tuple[str,np.ndarray]]] = dict()
for onto in ontologies:
    onto_iri2embeds[onto.base_iri] = list()
    for cls in onto.classes():
        onto_iri2embeds[onto.base_iri].append((cls.iri, model.wv.get_vector(cls.iri)))
        try:
            cls_name = ' '.join(parse_camel_plus(cls.name))
            onto_iri2embeds[onto.base_iri].append((cls.iri, embed_text(cls_name, model)))
        except ValueError:
            pass

Finally, we create several search indices to support textual search by criteria. Each index is created using FAISS.

import faiss
from faiss.swigfaiss import IndexFlatIP
from itertools import chain

# Create a faiss index out of a list of embedding mappings
def create_index(embeddings: List[Tuple[str,np.ndarray]], model: Word2Vec) -> IndexFlatIP:
    # make 2D array of all embeddings
    embeddings = [embedding[1] for embedding in embeddings]
    embeddings_2d = np.stack(embeddings, axis=0)
    # create a faiss search index, and add give embeddings to it
    search_index = faiss.IndexFlatIP(model.vector_size)
    search_index.add(embeddings_2d)
    return search_index

# Create a dictionary of Faiss indices.
# Creat an index for all classes, and index per ontology 
name2faiss_index:Dict[str, IndexFlatIP] = dict()
all_embeddings = list(chain.from_iterable(onto_iri2embeds.values()))
name2faiss_index['All'] = create_index(all_embeddings, model)
for ontology_iri in onto_iri2embeds:
    name2faiss_index[ontology_iri] = create_index(onto_iri2embeds[ontology_iri], model)

Remarks.

To support filtered search we defined multiple indices. Pinecone offers a vector database which supports filtered search. This eliminates the need to create dedicated filters.
To include other ontological entities (such as individuals, data properties or object properties), simply expand the index.

Step IV: Build a wrapper around your search engine indices and run queries

To activate our search engine, we may implement an API with query hooks or a CLI which triggers the system into producing the top matching ontological entities based on textual queries.

Given a search query and a search criterion, we retrieve the most relevant entities from the matching index. We will use our model to embed the query and use the search engine to return the top matching entities.

Running Example. Below is a simple code snippet that runs a textual query using our trained model with the relevant index.

First, we use our helper function and embed the query. Then, we use the relevant FAISS index (“All” in our example) to get the top matching entities.

Here is an example query over all the catalog ontologies:

# query  = 'onion' over entire catalog
query = 'spicy'
query_emb = embed_text(query, model).reshape(1,-1) # reshape because that's how faiss expect to receive it
scores, indexes = name2faiss_index['All'].search(query_emb, 5)  # faiss returns similarity scores and indexes of the closest vectors
results = [all_embeddings[ind][0] for ind in indexes[0]]    # get entity IRIs of the closest vectos
for result in results:
    print(result)

http://www.co-ode.org/ontologies/pizza/pizza.owl#SpicyTopping
http://www.co-ode.org/ontologies/pizza/pizza.owl#SpicyPizza
http://www.co-ode.org/ontologies/pizza/pizza.owl#SauceTopping
http://edas#MealMenu
http://www.co-ode.org/ontologies/pizza/pizza.owl#Food

As can be seen, we get four hits from the pizza ontology and one from the edas ontology.

Here is the same query, but only searched over the pizza ontology.

# 'onion' query over the pizza ontology
query = 'onion'
query_emb = embed_text(query, model).reshape(1,-1) # reshape because that's how faiss expect to receive it
scores, indexes = name2faiss_index['http://www.co-ode.org/ontologies/pizza/'].search(query_emb, 5)  # faiss returns similarity scores and indexes of the closest vectors
ontology_embeddings = onto_iri2embeds['http://www.co-ode.org/ontologies/pizza/']
results = [ontology_embeddings[ind][0] for ind in indexes[0]]    # get entity IRIs of the closest vectos
for result in results:
    print(result)
name2faiss_index.keys()
http://www.co-ode.org/ontologies/pizza/pizza.owl#OnionTopping

http://www.co-ode.org/ontologies/pizza/pizza.owl#RedOnionTopping
http://www.co-ode.org/ontologies/pizza/pizza.owl#GarlicTopping
http://www.co-ode.org/ontologies/pizza/pizza.owl#ChickenTopping
http://www.co-ode.org/ontologies/pizza/pizza.owl#Soho

Challenges

In this blog, we presented main steps to build a semantic data catalog and included a running example for illustration. We added remarks for considerations in the different steps. Still, it is important to note that are many more considerations and challenges that needs to be addressed. First, training a high-quality ontology embedding model and maintaining its accuracy as new data assets are added requires time and computational resources. Additionally, evaluating the effectiveness of the embeddings and updating the search indices as the catalog evolves is key in order to ensure that relevant results are retrieved. Addressing these challenges can require significant effort. We hope that some of these challenges will be addressed over time as vector search databases become more mature technologies. Still, these challenges must be considered and addressed in order to fully realize the potential of the semantic data catalog.

Concluding remarks. In this blog, we described the semantic data catalog, which offers a powerful and effective way to manage and discover data within a Data Mesh. Using ontologies assists in the organization and classification of data assets and improving data governance, while ontology-based vector search can provide more accurate and relevant search results to improve data utilization. The semantic data catalog also poses some new challenges, some of which we listed in this blog. If you are looking to improve your data management strategy, consider implementing a semantic data catalog to maximize the full potential of your data assets. Whether you are a data engineer, data scientist, or business user, a semantic data catalog can provide a wealth of benefits that can help you to better leverage your data to drive more insights and drive success.

Many thanks to Eslin Malki and Gil Rosenblum for their valuable feedback.