Text to Knowledge Graph

Knowledge Extraction Pipeline with Transformers

Published in

The Startup

12 min readNov 14, 2020

Introduction

Unstructured data in the form of natural language text is a valuable source of ‘knowledge’. Industry experts highlight the value of harvesting the text assets that accumulate in the enterprise. However, the prevailing language processing solutions are incomplete and not suitable for wider deployment.

Neural language models like the recent ‘Transformers’ are an exciting advancement in Natural Language Processing, they promise huge improvements to the current language processing solutions.

This article discusses an automated pipeline based on neural language models that extracts knowledge from Text and populates a Semantic Knowledge Graph.

Article plan is as follows:

a. Knowledge, Knowledge Graph & Text — The Why

b. Information Extraction: Entities, Relations, Linking — The How

c. A Language Processing Pipeline with Transformers — The What

d. Conclusion

Knowledge

Let us begin with the question, what is knowledge?

A working definition (without getting meta-physical 😊) is that knowledge is a set of entities or things (that we deal with in the real world), their properties and relations between them, which as it turns out models the real world more naturally and accurately! Here is a sample nugget of knowledge:

*Diagram of Knowledge — Entities (Nodes), Relations & Properties (Edges) —* Image by author

Entities: SJOBS, NEXT, APPLE, PIXAR

Properties & Relations: birth-date, founded-by, name

Values: “02–24–1955”, “Next, Inc.”, “Steven Paul Jobs”, “Apple, Inc”, “Pixar Animation”

As shown above, entities, relations and properties form a graph of nodes and edges, making graph structure a natural representation of knowledge. Knowledge structured in this form is an effective way to organize human thought, it clarifies intended meaning and it supports easy integration of ‘knowledge’ accessed over different channels and modes of communication. The graph abstraction of knowledge enables efficient algorithms and software applications!

Knowledge Graph (Labeled Property or Semantic)

A working definition of ‘Knowledge Graph’ is entities, properties and relations stored in a Graph database as nodes and edges. Knowledge i.e. entities, properties and relations extracted from data sources such as text can be structured as nodes and edges and loaded into graph databases with minimum impedance unlike relational databases.

In practice, Knowledge Graphs (KG) are planned to support specific class of applications targeting a domain or more generally the enterprise. And, the KGs are designed with suitable Ontologies (more below) to store pertinent knowledge that would support these applications like the knowledge about all the things that are relevant to the domain or the enterprise, as the case may be.

Two types of graph databases are used to build knowledge graphs; 1) Semantic Graph (SG), 2) Labeled Property Graph (LPG). LPGs are optimized for efficient graph algorithms, while Semantic Graphs are optimized for Semantic Query & Search and Knowledge Management.

Knowledge, besides the external structure of its constituent entities, properties and relations, has an internal structure that constrains what semantic types the entities belong to. This internal structure that constrains and ensures the consistency and accuracy of knowledge is the ‘Ontology’, which stipulates the semantic types, properties and relations between the types. (Detailed explanation of Ontology is beyond the scope of this article) Semantic Graphs (unlike LPGs) handle knowledge through the support for Ontologies. Semantic Graphs are supersets over LPGs, and some vendor triple store implementations support labeled properties natively.

Semantic Graph stores or ‘Triple stores’ are based on W3C standards for data encoding, serialization, querying and modeling. These standards are: URI (Universal Resource Identifier), RDF (Resource Definition Framework), SPARQL (Query Language) and OWL (Web Ontology Language). Semantic Knowledge Graphs require ‘shared common references or context independent references’ as entity identifiers in RDF-URI format.

Example: The Amazon River <http://dbpedia.org/resource/Amazon_River>

Semantic technology uniquely enables data linking, relationship analysis, pattern detection and inference of new facts from semantically enriched datasets.

It is interesting to note that the industry limelight on machine learning has since been shining on ‘Knowledge Graphs’ as the next exciting thing in AI! Knowledge Graphs are gaining visibility and growing popular in recent times. Gartners’ Hype Cycle places them at near peak in July of 2020.

Rest of the article deals with Semantic Knowledge Graphs as the target Knowledge Graph for the extractions from text.

Text as Source of Knowledge

Where do entities, properties and relations that make up the knowledge come from? One source of knowledge in the enterprise is the corporate relational databases consisting of tables holding entities, attributes and relations as foreign keys to tables that hold other entities. These entities, attributes and relations are easily turned into the nodes and edges of a graph of knowledge and consumed in applications. Similarly, any structured data source has necessary meta-information to guide the (re)structuring of data into nodes and edges of a graph of knowledge. Besides the structured sources, the other important source of knowledge is text (unstructured data).

Enterprises accumulate copious amounts of text in the form of memos, contracts, articles, emails, product reviews, manuals, reports, news, etc. etc., which are a trove of knowledge. It is not surprising, as humans are the ultimate producers and consumers of knowledge, that knowledge (entities, properties and relations), shows up in spoken and printed language (text). Text is the de facto source, laden with ‘knowledge’! For example, here is the text of a sentence; “George Martin, 72, lives in Santa Fe, New Mexico, with his wife, Paris McBride.” This sentence contains entities, relations, properties making it a source of knowledge.

Entities: “George Martin”, “Santa Fe”, “Paris McBride”, “New Mexico”

Relations & Properties: “lives in”, “wife”

Values: “72”

Note, with appropriate linguistic rules we can generate the “located in” and “age” relations, which are not present in the source sentence.

*Diagram of the (Unlinked) knowledge —* Image by author

In order for the extracted knowledge from the sentence to be added into a semantic knowledge graph other things must be accomplished. For example, the string “George Martin” must be resolved to its correct reference in our prior knowledge while disambiguating among multiple references, if they exist! (This resolution in humans underscores recognition and understanding!) Entity linking which addresses this problem is discussed later in this article. As we can see, end to end language processing pipelines must be capable of extracting the conveyed meaning beyond what is explicitly stated in the text.

Information Extraction

Knowledge extraction falls under the subject of the broader Information Extraction. Information Extraction (IE) deals with extracting structured information that is related to a specific topic from structured, semi-structured and unstructured data sources such as data entry forms, databases, websites, text documents, etc.

IE involves filler extractions for template slots such as data entry forms, job applications, sign up forms, etc. It is more involved when extracting information from web scrapes, extracting concrete data such as human protein codes, GS1 product codes, Zip Codes, IP Addresses, Email addresses, etc.

On the other hand, IE is complicated when extracting knowledge as in the example sentence in the previous section. The process consists of several broad stages in a pipeline. These stages are:

1. Preprocess

2. Resolve Co References, Classify entities (NER)

3. Extract Relations

4. Link to Knowledgebase

5. Ingest into target Knowledge Graph

These end to end pipelines include rule based processing, machine learning (ML) classifiers, and static configurations. Big part of the solution is resolving surface word forms to the correct conceptual entity (shared common reference) while disambiguating similar words that actually are different conceptual entities. Linking phase of the pipeline adds semantic annotations to the entities (See Entities & Relations section below) based on the target semantic knowledge graph. Types of entities are mapped to matching RDF Classes and instance URIs. For example, the strings “Gates”, “Founder of Microsoft”, or “the Philanthropist”, must resolve to the URI of “Bill Gates” or the word “Apple” must resolve to a fruit or to the company that produces “iPhone”.

Open Information Extraction utilizes open domain data sources such as Wikipedia as sources of training data and as target knowledgebase for the linking phase. Enterprises might additionally use internal databases.

Entities & Relations

Entity extraction (or Named Entity Recognition — NER) involves the extraction and tagging of sequences of word tokens from the text into pre-defined classes. Recognizing and tagging correctly involves resolving ‘co-references’, disambiguating similar mentions , etc. As the example in the previous section: “Apple” can refer to the organization “Apple Inc.” or the fruit “Apple” depending on the context. At different stages in the processing, syntactic and domain specific semantic rules guide the identification of entities and disambiguation of the meaning and context. Extracted entities are classified as people, places, organizations, events, product, numerals (dates, currency amounts, phone numbers, time, duration, frequency, weights, etc.), etc., which help machines understand what the text is about. This makes the tagged information answer fundamental questions such as ‘who, what, when, etc.’ of the text.

Similarly relation extraction targets deriving meaningful associations between entities from spans of words that (may) include verbs. Approaches that use look up lists of indicative-phrases to match known relations are ineffective and not scalable. Natural language accommodates a limitless variety of ‘indicative phases’ for specific relationship(s) and ML classifiers work better in these settings. For example the relation ‘works at’ between a ‘Person’ and an ‘Organization’, can appear in arbitrarily complex and variety of expressions in natural language. For example, “Satya Nadella of Microsoft…”, “Microsoft CEO, Satya Nadella…”, “Satya Nadella and his colleagues at Microsoft..” you get the point! Large pre-trained models fine tuned on learning tasks like the language entailment or the inference produce reasonable predictions of relation tags.

Linking to target Knowledgebase

Entity linking involves matching the mentions (surface forms) to their corresponding entries in a target repository (knowledgebase). Linking process disambiguates and resolves strings of words to their correct entity in the target knowledgebase.

In open domain information extraction open data repositories like DBpedia serve as the target Knowledgebase against which to resolve the extracted entities. Spotlight is a web-service to automatically annotate ‘textual mentions’ of DBpedia resources, but in the case of a private enterprise, the target knowledgebase is likely to be a proprietary data repository. Curated dictionaries are used for entity linking as well. For some known entities like mentions of countries, lists of all countries are available in public domain. Solutions support discovery of new entities through additional curation process flows.

There are various ways to perform entity linking, ranging from string matching, to applying sophisticated rules for exact and fuzzy matches, to using machine learning classifiers in a supervised setting, to semantic resolution techniques that use sub-graph matching.

Below is the output graph at the end of the pipeline processing of the sentence: “George Martin, 72, lives in Santa Fe, New Mexico, with his wife, Paris McBride.”

Named Entities: “George Martin”, “Santa Fe”, “Paris McBride”, “New Mexico”

Relations & Properties: “lives in”, “wife”

Numerical Value: “72”

Figure on the left above, shows the tagged extractions from the text. The two unmentioned relations ‘located-in’ and ‘age’ are generated using special set of linguistic rules. The linking stage matches tagged extractions to correct entries in the knowledgebase and ultimately produces the graph in the figure on the right. (Labels in ALL-CAPS under the icons are unique object IDs in the knowledgebase)

A Language Processing Pipeline with Transformers

Rules based pipeline logic works in environments where the input text has limited variability within practical limits but doesn’t scale to the problem of the noisy ambiguity of natural text. ML techniques are more suitable in dealing with text, and ML classifiers are effective in extracting entities and extracting relations. In practice a combination of rules and ML models suit better for the complex knowledge extraction tasks. Rules are hand crafted and derived from linguistic theory while ML classifiers are trained on suitable datasets.

Natural Language Processing (NLP) techniques detect structure, identify sentences and parts-of-speech tags in the sentences, such as ‘noun’, ‘pro-noun’, ‘adjective’, etc. as well as identify higher level tags such as ‘subject’ or ‘object’ of the sentence, while, machine learning handles the task of classification of word tokens into categories such as ‘person’, ‘organization’, ‘location’, etc. Similarly, with labeled training data, such as pairs of ‘person’ and ‘organization’ for a ‘works-for’ relationship, an ML classifier learns the patterns to tag the relation appropriately. However, training data needed for ML classifiers are a bottleneck!

Neural Language Models based on Transformers mitigate the problem of hard to get training datasets. These new class of language models based on transfer learning are huge with hundreds of millions if not billions of parameters, pre-trained on hundreds of gigabytes of text such as Wikipedia, Book Corpus (10,000+ books of different genres) etc. An essential aspect of these models is that they can be fine tuned i.e. the weights of the models updated with training on new datasets specific to the target use case. Pre-trained on large general text corpus, these models need only small amounts of task specific training data during the fine tuning process. XLNet, BERT and other similar language models work well as Named Entity and Relation classifiers with appropriate learning objectives. NLP toolkits Stanford Core NLP, Stanza, Spacy which make use of linguistic knowledge are apt for the preprocessing stage of the pipeline.

TextDistil from Lead Semantics is an end to end language processing pipeline, which takes text documents in ‘English’ as input and outputs triples with entities, properties, relations in the standard ‘RDF nquad’ format (nquad is the serialization format for knowledge triples or facts in RDF/XML format). This output can be readily ingested into standard RDF triple stores. TextDistil pipeline consists of modules that use CoreNLP for preprocessing and BERT and XLNet classifiers for named entity and relation classification and a custom module for entity linking. During the linking phase extracted entities and relations are mapped to RDF class instances and predicates (datatype and object properties) of the target knowledge graph.

In a test run, TextDistil processed 10,000 business news articles from the public “Reuters-21,578” corpus and output an ‘RDF-nquad’ file with 0.041 million (~41 k) triples about 955 PERSONs, 10392 ORGANIZATIONS, 1146 GEO_POLITICAL_ENTITIES, 17 ARTIFACTS+PRODUCTS, 4 EVENTS, 13 PREDICATES. (Caveat: In a typical enterprise situation, these triples would be subject to further validation procedures before promoting them to production knowledge graph!)

Below is a screenshot of the verification tool called ‘FactCheck’, which shows text articles processed through the pipeline along with its output in an interactive RDF graph. FactCheck highlights the corresponding sentence from the text document(s) when a triple in the graph is selected, helping to verify if the triple is valid.

Key observations of the test are:

Large number of (~ 0.19 million) candidate triples whose predicted relations did not match during the linking were discarded! There is potential for correct triples in this pile, but pipeline may need to be modified
FactCheck has proven to be invaluable in debugging the pipeline. At the time of this writing we had completed visual verification of 15 documents (~ 250 sentences)
Output ‘RDF-nquad’ file with 41,437 triples was loaded into instances of target Knowledge Graph in three different vendor Triple stores — Amazon Neptune, Franz, Inc Allegro Graph and Cambridge Semantics AnzoGraph. Differences between RDF and RDF* were considered and each needed its own ‘Ontology’ (simple ontology but slightly different from the others)
End to end automated knowledge extraction pipelines with neural language models are practical for use in production environments!

Conclusion

Industry experts highlight the value of harvesting the text assets that accumulate in the enterprise. Natural language is messy and it is nearly impossible to get high levels of accurate extractions of the knowledge.

Prevailing language processing solutions are incomplete for wider deployments. Recent neural language models like the ‘Transformers’ enable improved and more automated ‘knowledge’ extraction pipelines like TextDistil. Visual tools like FactCheck are invaluable for debugging end to end language processing pipelines.

Knowledge Graphs based on Semantic Technology are better suited for Enterprise Knowledge Management which is critical to organizations. Mature W3c standards based semantic triple stores have been in production use for a number of years, which are well suited for Enterprise Knowledge Graphs.

In summary, end to end language processing pipelines with the latest neural models like TextDistil, are practical and work well as automatic knowledge feeds from unstructured sources into the Enterprise Knowledge Graph(s).