Semantic Search via Knowledge Graphs in Python

All you need to gain a basic intuition on why you should care about them + a python spaCy snippet

6 min readDec 10, 2022

Semantic search is an art similar to fuzzy search — with the main difference being that instead of searching for approximate string match, semantic search is focused on searching for approximate concept match. An ideal semantic search engine wouldn’t be limited to looking past mere spelling errors — it would be able to understand intent of the query, even if queried documents wouldn’t contain specific phrases used to formulate the information need, or if the phrases were appearing in different contexts as false friends.

This blog post is structured as following:

Semantic search and why it matters
What are knowledge graphs?
Building knowledge graphs in Python
PudMed’s application of knowledge graphs
Conclusion

A knowledge graph, at least according to Stable Diffusion

Semantic search and why it matters

To give a concrete example of what would be the consequences of a semantic search, consider the following:

Query: lesion
Collection:
– D1: Tumor microenvironment
– D2: To mock a mockingbird
– D3: Early settings of the Kyrie eleison and the problem of genre definition
– D4: Second neoplasm in patients with head and neck cancer
Relevance function R:
– R(D1) := 1
– R(D2) := 0
– R(D3) := 0
– R(D4) := 1

For a human it’s a fairly simple query, with the only prerequisite being the knowledge that tumor is another word for neoplasm, and that they both are a subcategory of lesions (and that lesion has nothing to do with eleison, despite small Levenshtein distance). However, we still need a method of encoding this knowledge inside our search function.

What are knowledge graphs?

One of the most useful forms of knowledge representation is a knowledge graph. A knowledge graph consists of a set of nodes with a label corresponding to each node and a set of edges connecting them with each other. Each edge is labeled by a property — for example acne could be connected to neoplasm via the fact that both are subcategories of lesions.

A knowledge graph with non-transitive relations. Even though Mr A
is president of Country B and Citizen C, he doesn’t have a defined relationship with
Country D

A knowledge graph can contain some additional information as well, such as what are the relationships between nodes (for example if one node is an instance of another). It can also be annotated with additional metadata such as who created it or where it was created. The nature of the relationships is rarely symmetric, but can either be non-transitive or transitive. It means that not only need to differentiate between their sides using directed edges, but we also have to store the information about mechanics of a specific graph in its metadata.

Fragment of PubMed’s knowledge graph of MeSH headings. The re-
lationships between nodes are unlabeled, but transitive. *Ear* is a descendant of all
the nodes above it — so not only Sense organs and Head, but also Body regions and
Anatomy. Lumbosacral Region, on the other hand, is a descendant of Body regions, but
not of Anatomy.

Building knowledge graphs

The most trustworthy method of building a knowledge graph is using human specialists. However, what we gain in precision, we lose in recall — there is a finite number of concepts that humans can analyze and find connections within, and the number of potential edges grows exponentially with the number of concepts. Therefore, hand curation of each possible edge is impracticable for expert-based graph creation.

In order to circumvent this issue, we can use crowdsourcing. Crowdsourcing is a process of harnessing the collective intelligence of the crowd to solve a problem, where people are given a task and are paid for their contribution, or the volunteers are given an incentive for their participation. In this case, the participants are given access to a large knowledge graph that includes nodes and edges with labels. They can then analyze these concepts and relationships in order to determine which nodes are related and how they relate, creating a knowledge graph. This method has been used successfully in many different domains such as medicine or geography.

We can also choose another approach, a vertex-based one. Instead of curating edges, we can input as much information about as many specific nodes as we can, merging vertices that are redundant between different entities. This method lowers our recall, as we may forget to add existing connections, but is much more scalable.

Finally, there are many automatic approaches that we can choose from. The most commonly used ones are based on natural language processing (NLP) algorithms and a knowledge base representation, which stores each connection as a (subject, predicate, object) triplet. We can extract them from text sources using a combination of NLP techniques — entity recognition, entity linkage, dependency parsing and part of speech tagging.

To use spaCy for this purpose, you will first need to install the library and download a pre-trained model. This can be done using the following commands:

pip install spacy
python -m spacy download en_core_web_md

Once you have spaCy installed and the pre-trained model downloaded, you can start extracting knowledge triplets from text. Here is an example of how this can be done:

import spacy

nlp = spacy.load('en_core_web_md')

def extract_triplets(sentence):
    doc = nlp(sentence)  # Parse the sentence with spaCy
    triplets = []
    for noun_chunk in doc.noun_chunks:  # Iterate over the noun chunks in the sentence
        # Extract the text of the noun chunk and use it as the subject of the triplet
        subject = noun_chunk.text
        # Find the verb that governs the noun chunk and use it as the predicate of the triplet
        for token in noun_chunk.root.children:
            if token.dep_ == "ROOT":
                predicate = token.text
        # Find the direct object of the verb and use it as the object of the triplet
        for token in noun_chunk.root.children:
            if token.dep_ == "dobj":
                object = token.text
        triplets.append((subject, predicate, object))
    return triplets

# Define a sentence to parse
sentence = "The quick brown fox jumps over the lazy dog."

# Extract the triplets from the sentence
triplets = extract_triplets(sentence)

# Print the triplets
for triplet in triplets:
    print(triplet)

This code will identify the subject, predicate, and object of each noun chunk in the sentence and return them as a list of triplets. For the given sentence, this code will output the following triplets:

('The quick brown fox', 'jumps', 'the lazy dog')

Keep in mind that this is just a simple example, and the actual implementation of your triplet extraction logic will depend on your specific use case and the text you are working with.

PudMed’s application of knowledge graphs

In the domain of biomedical publications, PubMed’s search engine uses knowledge graphs for both synonyms and dependency recognition. Their knowledge graph (including, but not limited to Medical Subject Headings, MeSH) is stored in a from of a tree, part of which is shown on figure. This KG doesn’t have type of dependency linked to its edges, because all of them represent the same relationship — an edge pointing from X to Y means “Y belongs to X”.

PubMed’s search engine uses its KG during both indexing new documents and querying preexisting collections. When new biomedical articles appear in the MEDLINE database they are analyzed using PubTator annotation service. This way each document is mapped to the predefined MEDLINE vocabulary of MeSH tags and subheadings, as well as

different spellings,
drug brand and generic names,
publication types,
pharmacologic action terms,
supplementary concept and substance names,
synonyms.

During query time PubMed also uses this approach — this time to enhance user’s request to synonymic vertices of the knowledge graph. For example, were user to search for “odontalgia”, PubMed would translate this into the following query: “toothache”[MeSH Terms] OR “toothache”[All Fields] OR “odontalgia”[All Fields] OR “odontalgias”[All Fields]

However, in order to substitute a word it needs to be included in the list of synonyms in exactly the same form — so, for example, aching tooth wouldn’t get recognized as a synonym for toothache. Fortunately, recent advancements of NLP help us in alleviating this problem.

Conclusion

In summary, this blog post discusses semantic search and its potential benefits. It explains that an ideal semantic search engine would be able to understand the intent behind a query, even if the queried documents do not contain the specific phrases used in the query. The blog post also discusses the role that knowledge graphs can play in improving semantic search by encoding knowledge about the relationships between concepts.

The blog post provides several methods for building knowledge graphs, including using human specialists, crowdsourcing, and automatic approaches based on natural language processing algorithms. It also discusses how PubMed’s search engine uses knowledge graphs to improve its results by mapping documents to a predefined vocabulary and enhancing user queries with synonyms. Overall, the blog post provides a useful introduction to the concept of semantic search and the role of knowledge graphs in improving it.

Follow me on Twitter for more machine learning!

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com