Supporting search engines with knowledge and context — Part 1: Knowedge Graphs
The most common way to find answers on the Internet is to enter keywords in a web search engine (Google, Yandex, etc) and browse the returned results. However, in some cases the user is looking for a short and precise answer (e.g. “Capital of France”, “Birth date of Obama”). Instead of having to navigate through many web pages, they could hope to directly obtain the answer in form of text description all as the following case:
Apart from the traditional “ten blue links”, present-day search engines (SE) are increasingly being enriched with additional context that often comes from structured knowledge sources to enhance the user experience. Knowledge Graphs (KGs), which store world knowledge in the form of facts, are the most prominent structured knowledge source for search engines.
First, let’s briefly perceive what is a search engine, what is a knowledge graph and how they both contribute.
Search engines, e.g. Google, are charged to be queried and retrieve the adequate answer for each query. However, queries or questions are language human-based, and as with any question, they can be formed in multiple forms while meaning the same thing, thus search engines must understand semantically each query just like a human does and not merely match keywords. A knowledge graph (KG) is one of the forms we can store real-world entities— i.e. people, organizations, events, or concepts — and illustrates the relationships between them. That said, and in a grand scheme of things, KG is built by linking many knowledge bases to form one single graph where entities are linked by edges, and this relationship information between entities is usually stored in a graph database (e.g. storing entities in RDF form <Object, Predicate, Subject>) and visualized as a graph structure.
Now that we got that out of the way, now that we understand each of our components (Search engine, Knowledge Graph) it’s time to know how, they both contribute, particularly, how KG supports search engines in answering the user’s question.
Imagine we are interested to know who is the founder of Microsoft, then we will go to Google and simply query “Who is the founder of Microsoft?” then google will answer something like “Bill Gates is an American business magnate and the principal founder of Microsoft Corporation”. And we were done, we got what we asked for. Yeaah !!
Wait! 👀 let’s see how this was performed.
First, at the moment of writing this article, Google has a large Knowledge Graph so-called Google Knowledge Graph (GKG) with 570 million nodes (entities) and 18 billion edges (relations). GKG entities are mapped to all the web text corpus where they were retrieved to create GKG.
For now, remember these four steps:
- Extract knowledge from the text query
- Map the extracted knowledge to GKG
- Complete the triplet with the missing entity
- Convert the triplet to human-readable text
Give this figure a try to know about each step :
Indeed, to explain the four steps above, Google SE extracts knowledge fact (triplet) from our query (our first how) and then maps it to GKG (second how). After matching knowledge between our query and GKG it retrieves documents by ranks that are relevant to answer our question (third how). Therefore it shows the answer in a readable human form (fourth how), just like the figure above illustrates, mostly in text description form.
It was quick, right? don't worry we will dissect each one of our how(s) in nitty-gritty details!
But wait, what if there is no text on the web that matches our query, oh no!
Problematic
Indeed, it’s true that given a KG fact extracted from the user’s query, we can retrieve the textual description of the fact from our text corpus, but this assumption is under the condition that a relevant sentence exists already in the underlying text corpus. What happens when there is no explicit answer in the text corpus? How does a search engine generates this description then?
Because, when we index content for search, it’s natural to think in terms of documents. But searchers aren’t necessarily looking for documents. They’re looking for information. That information may be a small part of a document or may be distributed across documents. And it's up to the search engine to aggregate all of them to end up with a small text answering the user.
Furthermore, KG fact descriptions often contain mentions of other, related facts that provide additional context and thus increase the user’s understanding of the fact as a whole (e.g. Bill Gates founded Microsoft with Paul Allen). Given the large size of KGs, many facts could potentially be relevant to the fact of interest, thus how do we find those facts? How can we contextualize a KG query fact by retrieving other, related KG facts?
All these questions will be tackled gradually step-by-step in this series of articles, but on this one, we will focus on the first step of this pipeline of Search Engine Question Answering, which is, given a text question like “Who is the founder of Microsoft?” how can we extract the RDF triplet (i.e. <?, FounderOf, Microsoft>) so that we can map it to our KG and complete the missing part of the triplet to get <Bill Gates, FounderOf, Microsoft>, therefore, answer the user by converting the output triplet to human readable answer (text)! as explained before.
Good?
Let’s get started😊!
Theoretically, the natural way to map questions into normal forms is to analyze the grammatical structure of the sentence. Some algorithms use the concrete syntax tree (or parse tree) representation to perform this task (Rusu et al., 2007; Defazio, 2009). Another popular representation is the dependency tree by Stanford Parser (de Marneffe and Manning, 2013).
However, we will not go into detail on how this approach of extracting dependency tree form from a text. However, as an overview, we can simply remember that from a given sentence query we build a dependency tree using its grammatical structure. like this example:
……………Where was the president of the United States born?………………
and then simplify it through some basic transformation (Merge, Remove, Replace) till we wind up with a triplet structure as shown in this figure.
As you can see, after transformation our tree becomes more helpful in such a way that we separate our search to first complete the first triplet <?, president, United States> that will be involved in another triplet to identify the missing part (in this case the place) in < Joe Biden, birthplace, ?>
If you are interested in how these transformations are performed feel free to check this article that can drive you through the nitty-gritty details of this approach😊!
And if you want to know more about dependency parsing & associated algorithms in NLP, you can check these ones also, here1, here2!
However, for what follows in this article we will focus on how to practically extract knowledge from a text, so-called Information Extraction! so that we can build our own RDFextractor where we focus to pass through the main 3 steps of information extraction:
- ✅ Coreference Resolution
- ✅ Entity Linking
- ✅ Relationship Extraction
What is an information extraction pipeline?
To put it in simple terms, information extraction is when you extract information from unstructured data such as text.
Pronouns, like he, she, and it are tremendously compulsory in writing texts to not be redundant. Right, but how can our extractor can understand that “he” in this sentence, “Benzema is the best football player in 2022, he actually deserves it” refers to Benzema! there where the coreference resolution step comes to become!
- ✅Coreference Resolution
There are a lot of projects that can be used in finding and resolving references in the text. I can mention the one mostly used in production, NeuralCoref of HuggingFace 🤗 which runs on top of Spacy! you can have a try here in the demo!
The model actually replaces the pronoun “he” with its reference “Benzema”! With this simple, you can make your Information Extraction Pipeline more accurate!
To know more about how that was done, check here!
Wait 👀 ! what if we changed Benzema to Ronaldo? and query our engine by :
“How many golden ball Ronaldo has?”, how can a search engine knows which Ronaldo we are talking about?
There is where Entity linking step comes to become 😊.
2. ✅ Entity Linking
Named Entity Linking also called Entity Linking, we start by recognizing all entities in the text. Once it finishes the Named Entity Recognition process, it tries to link those entities to a target knowledge base. Usually, the target knowledge bases are Wikipedia or DBpedia. This process is also known as the Wikification process!
As to the subject of which Ronaldo, Entity Linking does also Entity Disambiguation.
You can have a look in this figure and see how many footballers with the name contains “Ronaldo” in Wikipedia:
Actually, there are many models that can be used in this regard, I can mention the Wikifier API model!
When we input our text Wikifier will be charged to recognize the Parts-of-speech as shown in this figure:
And afterward, we got our annotations:
The output then recognizes Ronaldo as a Person and also as the Brazilian footballer, and the great thing about the wikification process is that we also return the corresponding WikiData ids for entities along with their titles. That’s great because you can always map your query to your Knowledge Graph using these ids, a thing that will help later on in the future articles to map our extracted triplet to our KG. Otherwise, for those wondering about the recognized Ronaldo, that’s actually up to many reasons, I can mention some of them:
- First, because, there was not enough context in our query to guide our linking to the Portuguese Cristiano Ronaldo 😄!
- Because the wikification process used only limited metrics to link candidates such as PageRank and since Cristiano's page is lower than Brazilian Ronaldo's one then the Brazillian was more qualified up to this reason!
- Because of some SEO tricks!
Furthermore, there are all candidates that our entity Ronaldo based on the query we inputted can be linked to, like showing in this table:
The first column refers to the PageRank score. Briefly, the PageRank score is, how important your page is, and how many “significant” pages on the web are pointed to it (think graph🧠 ), e.g. Machine Learning’s Wikipedia page will be more highly scored than the Learning’s one! Baam!
In conclusion, in reality it's a bit complicated, Google uses more than one metric to rank candidates in Entity Linking I can mention the KPI of how many clicks a link has regarding another link, I believe if we put the same query in the search engine of Google we will get Cristiano as a result because this latter is more famous and the web is deluged by its content so its more probable that the searcher looks for Cristiano than the other Ronaldos :) but in general, keep in mind that not giving enough context to our EL especially if we use API such as Wikifier that depends on limited metrics (PageRank, Similarity, etc) leads to less good results, indeed, must provide-in more context in the query so that similarity will help to be linked to the adequate knowledge base entities.
If we come back to the first example we used at the beginning of this article, <Paris, CapitalOf, France>!
Now that we got our entities with the right reference to our knowledge base, its time to complete our triplet, we need the relation that links our entities! So!
3. ✅ Relationship Extraction
The last part of an information extraction pipeline is to extract the relation between entities.
Relation extraction is the task of extracting from natural language text relational triples such as:
(founders, SpaceX, Elon_Musk)
(has_spouse, Elon_Musk, Talulah_Riley)
(worked_at, Elon_Musk, Tesla_Motors)
If we can accumulate a large knowledge base (KB) of relational triples, we can use it to power question answering and other applications. Building a KB manually is too slow and expensive, but much of the knowledge we’d like to capture is already expressed in abundant text on the web. The aim of relation extraction, therefore, is to accelerate the construction of new KBs — and facilitate the ongoing curation of existing KBs — by extracting relational triples from natural language text.
Additionally, extracting relation triplets from raw text, is also enabling multiple applications such as populating or validating knowledge bases, fact-checking, and other downstream tasks.
In fact, to build a knowledge graph from text, we typically need to perform two steps, extract entities, a.k.a. Named Entity Recognition (NER), to have nodes of the knowledge graph and relations between these entities, a.k.a. Relation Classification (RC), to link between them and end up with entities + edges of the knowledge graph. Good! These multiple-step pipelines often propagate errors or are limited to a small number of relation types. Recently, end-to-end approaches have been proposed to tackle both tasks simultaneously. This task is usually referred to as Relation Extraction (RE). One of the great projects in this respect is an end-to-end model called REBEL, from the paper Relation Extraction By End-to-end Language generation.
To have an idea about how REBEL works, check this paper! or simply this medium’s article and also this for the code part!
To make it brief, there is an interesting lecture about Relation Extraction from Stanford's Youtube Channel!
Considerations and next steps
First, thank you for making it to this part, and now that we got an overview of the first part as an introduction to how to make structured knowledge more accessible to the user by discussing how can we turn an unstructured question into a structured query of the knowledge graph (RDF triplet <Object, Predicate, Subject>). Next steps, we will perceive how can we retrieve textual descriptions of a given fact from a given text corpus. After then, how can we automatically generate a textual description of the fact in the absence of an existing description. Furthermore, perceive how we can contextualize a Knowledge Graph query as a fact by retrieving other, related KG facts that can help the user to have access to more structured knowledge about a given query in the search engine!
Fin d’épreuve! :)
References
- https://yassine-hamoudi.github.io/files/other/RDFTriples.pdf
- https://web.stanford.edu/~jurafsky/slp3/14.pdf
- https://medium.com/data-science-in-your-pocket/dependency-parsing-associated-algorithms-in-nlp-96d65dd95d3e
- https://towardsdatascience.com/how-to-find-shortest-dependency-path-with-spacy-and-stanfordnlp-539d45d28239
- https://wikifier.org/
- https://colab.research.google.com/drive/12gwdua-Fs7H31HIc5frzavT6ofyeLWux?usp=sharing
- https://www.youtube.com/watch?v=pO3Jsr31s_Q