Challenges of Knowledge Graphs

From Strings to Things — An Introduction

Sebastien Dery
Dec 1, 2016 · 7 min read

#tldr Search and even semantic search are simply not enough these days. Users request condensed information that is easy to ingest in order to make sense of an ever more complicated world. This require a new approach to uncovering and presenting information relying on aggregated facts and knowledge. This post begins by exploring the typical workflow and pain points faced by teams diving into the highly challenging task of extracting and organizing what is currently known about the world.

Deep Learning is stealing the spotlight in Silicon Valley, no illusion. But it’s actually a whole different beast feeding and enriching the daily results proposed by your favorite search engine. From the moment you click search, multiple attempts are made at understanding what your query is all about. After all, Google’s core business relies on answering people’s questions as accurately and rapidly as possible. So how do you teach your system, and ultimately your users, that Machine Learning and Data Science are related? Or that Noam Chomsy is an American Linguist? What about the height of Mount Makalu? And that word embedding is part of natural language processing? Easy, you figure out the answers and tell them about it. Or at least that’s the idea behind Knowledge Graphs (KG).

In 2012 Google officially made their announcement of their own version creatively titled Knowledge Graph, as their first step towards not only searching for pages that match your query terms but also for “entities” that the words describe. The goal is to give you the known facts about various things in one neat little card; taking you from zero to hero in a single read through. Computers don’t need to be symbolic communicators but humans apparently do.

Beyond the obvious searching and displaying information about entities, these interconnected units of knowledge actually powers and enhances multiple backend features:

  • Disambiguating and recognizing entities in context

How does it work?

KGs come in different shapes and sizes; emerging from both companies and open source communities; human curated and automatically generated; with fixed ontology or continuously expanded. Regardless of their differences, most KGs will follow a simple principle: organizing information in a structured way by explicitly describing the relations among entities.

Simple enough isn’t it?

From strings to things, knowledge graphs aim to structure what is known about the world. From powering up search to quick summaries of known entities, it makes information that much easier to discover and enables world-aware inferences.

Creating a large knowledge base is actually quite a challenge in part due to the difficulties and subtleties of language and the ethereal transient nature of knowing something (e.g. facts and knowledge are continuously evolving). For all the language understanding thrown around in conferences and recent developments in deep learning (e.g. dynamic memory networks), there is still no universal algorithm for parsing and distilling a thorough, non ambiguous, understanding of text. We still have a long way to go and until then, current algorithms have definitive difficulties deciphering the meanings and intent behind words. Take for example:

Difficulties of language you say?

This is even more obvious when you consider the wealth of cultural and historical information each of your users bring to the search box. For example, you may consider “to be or not to be” an obvious quote from Shakespeare but someone not familiar with his work is likely to think it’s just a quirky amalgam of words. Your system needs to be aware of these scenarios! And if it could recommend similar works of literature that’d be great, thank you very much.

There are of course various techniques to help you on your journey but soon you realize the difficulties are not purely algorithmic and transcend multiple domains of engineering. This post will outline some of the major pain points along with potential avenues of solutions for you to consider.

At its most basic, the end game of any knowledge mining is a list of entities (i.e. a recognizable sequence of characters with a specific meaning) and triplets; often simply described as a subject, object, and predicate (or entity-attribute-value and many other variants…).

The basic interpretation of a triplet is a subject, object, and a predicate linking the two.

All we need to do is extract entities, uniquely resolve them, and link them together. Is it as easy as its sounds? … Does it sound easy?

For readability I’ll be splitting this blog into a multi-part series diving into various aspects of the challenges involved with building and maintaining knowledge graphs (i.e. from algorithms to storage) while sharing some code and tips to get you started.

Projects you should know about

Knowledge Graphs are not a new idea that came out of the blue (you can actually trace it back to the late 1960's). A lot of research and projects have been dedicated to this idea. With the wealth of projects out there it’s easy to feel some kind of vertigo; there always seems to be something out there worth knowing, a new or old obscure project you’d never heard of (but for some reason everybody else does). Here’s a quick list of projects you should know about to get you up to speed. By all means shout out if you feel there’s a missing one you’d like to see (this is really not an exhaustive list).

Knowledge graph of Knowledge graphs
  • Never-Ending Language Learning (NELL): Research project from Carnegie Mellon University attempting to create a computer system that learns over time to read the web (over 50 million candidate beliefs).

What’s the big difference between them? Time, money, domain, approach and supporting organization. At the end of the day, the challenges of extracting, disambiguating and linking entities of the world is an open problems. There is no one-size-fits-all solution that truly and automatically makes sense of the knowledge embedded within natural language with all of its subtleties. This led to considerable amount of exploration and specialization. After all, the struggle of making sense of information is one that every individual shares.

Part 2–? coming soon…

Beyond the obvious, the area of Knowledge Graphs entertains a number of open research questions, namely related to: acquisition, growth, aggregation, storage, veracity, time-dependency, search, crowdsourcing, coldstart, deduplication, standardization, harmonization, and language support (to name a few). Stay tuned for the following posts where I’ll attempt to distill some solution from current practices and where we are still failing.

Read more…

Sebastien Dery

Written by

Master of Layers, Protector of the Graph, Wielder of Knowledge. #OpenScience #NoBullshit