Knowledge extraction from unstructured texts
There is an unreasonable amount of information that can be extracted from what people publicly say on the internet. At Heuritech we use this information to better understand what people want, which products they like and why. This post explains from a scientific point of view what is Knowledge extraction and details a few recent methods on how to do it.
What is knowledge extraction?
Highly structured databases make it easy to reason with and can be used for inference. For example in WikiData or YAGO, entities are isolated and linked together with relations. However, most of the human knowledge expressions take the form of unstructured texts, from which it is very hard to reason and get wisdom. Consider the example here:
The raw text on the left contains a lot of useful information in an unstructured way, such as birthday, nationality, activity. Extracting those information corresponds to a challenging field in Natural Language Processing, which may require sentence parsing (mapping natural language to machine-interpretable representations), entity detection and multi-reference resolution to aggregate information about the same entity. Knowledge extraction is guided, for example, by the will of being able to perform Question Answering tasks: in a structured knowledge base, one can make a query and then get the requested information. Another application is to perform arbitrarily complex reasoning by finding paths in a graph of extracted knowledge. In knowledge extraction, one can be interested in hypernymy where entities are included within other entities and one can also be interested in relation extraction.
The purpose of this blog post is to review methods that make possible the acquisition and extraction of structured information either from raw texts or from pre-existing Knowledge Graph. More precisely, we aim at semantically parsing a text in order to extract entities and/or relations. We define a triplet in a sentence as a relation r between two entities e1 and e2: ( e1, r, e2). A Knowledge Graph (KG) denotes a collection of triplets that draw a graph: vertices are entities and edges are relations. Most of the articles presented below assume that entities are identified and disambiguated. In practice, this…