Challenges of Knowledge Graphs
From Strings to Things — An Introduction
#tldr Search and even semantic search are simply not enough these days. Users request condensed information that is easy to ingest in order to make sense of an ever more complicated world. This require a new approach to uncovering and presenting information relying on aggregated facts and knowledge. This post begins by exploring the typical workflow and pain points faced by teams diving into the highly challenging task of extracting and organizing what is currently known about the world.
Deep Learning is stealing the spotlight in Silicon Valley, no illusion. But it’s actually a whole different beast feeding and enriching the daily results proposed by your favorite search engine. From the moment you click search, multiple attempts are made at understanding what your query is all about. After all, Google’s core business relies on answering people’s questions as accurately and rapidly as possible. So how do you teach your system, and ultimately your users, that Machine Learning and Data Science are related? Or that Noam Chomsy is an American Linguist? What about the height of Mount Makalu? And that word embedding is part of natural language processing? Easy, you figure out the answers and tell them about it. Or at least that’s the idea behind Knowledge Graphs (KG).
In 2012 Google officially made their announcement of their own version creatively titled Knowledge Graph, as their first step towards not only searching for pages that match your query terms but also for “entities” that the words describe. The goal is to give you the known facts about various things in one neat little card; taking you from zero to hero in a single read through. Computers don’t need to be symbolic communicators but humans apparently do.
Beyond the obvious searching and displaying information about entities, these interconnected units of knowledge actually powers and enhances multiple backend features:
- Disambiguating and recognizing entities in context
- Data expansion to enrich semantic search
- Connecting entities to content and data sources
- Recommendation engine for related information
- Entity-centric user interfaces
- Inferential reasoning
How does it work?
KGs come in different shapes and sizes; emerging from both companies and open source communities; human curated and automatically generated; with fixed ontology or continuously expanded. Regardless of their differences, most KGs will follow a simple principle: organizing information in a structured way by explicitly describing the relations among entities.
Simple enough isn’t it?
Creating a large knowledge base is actually quite a challenge in part due to the difficulties and subtleties of language and the ethereal transient nature of knowing something (e.g. facts and knowledge are continuously evolving). For all the language understanding thrown around in conferences and recent developments in deep learning (e.g. dynamic memory networks), there is still no universal algorithm for parsing and distilling a thorough, non ambiguous, understanding of text. We still have a long way to go and until then, current algorithms have definitive difficulties deciphering the meanings and intent behind words. Take for example:
This is even more obvious when you consider the wealth of cultural and historical information each of your users bring to the search box. For example, you may consider “to be or not to be” an obvious quote from Shakespeare but someone not familiar with his work is likely to think it’s just a quirky amalgam of words. Your system needs to be aware of these scenarios! And if it could recommend similar works of literature that’d be great, thank you very much.
There are of course various techniques to help you on your journey but soon you realize the difficulties are not purely algorithmic and transcend multiple domains of engineering. This post will outline some of the major pain points along with potential avenues of solutions for you to consider.
At its most basic, the end game of any knowledge mining is a list of entities (i.e. a recognizable sequence of characters with a specific meaning) and triplets; often simply described as a subject, object, and predicate (or entity-attribute-value and many other variants…).
All we need to do is extract entities, uniquely resolve them, and link them together. Is it as easy as its sounds? … Does it sound easy?
For readability I’ll be splitting this blog into a multi-part series diving into various aspects of the challenges involved with building and maintaining knowledge graphs (i.e. from algorithms to storage) while sharing some code and tips to get you started.
Projects you should know about
Knowledge Graphs are not a new idea that came out of the blue (you can actually trace it back to the late 1960's). A lot of research and projects have been dedicated to this idea. With the wealth of projects out there it’s easy to feel some kind of vertigo; there always seems to be something out there worth knowing, a new or old obscure project you’d never heard of (but for some reason everybody else does). Here’s a quick list of projects you should know about to get you up to speed. By all means shout out if you feel there’s a missing one you’d like to see (this is really not an exhaustive list).
- Never-Ending Language Learning (NELL): Research project from Carnegie Mellon University attempting to create a computer system that learns over time to read the web (over 50 million candidate beliefs).
- Freebase / Probase: Deprecated since Aug 31 2016. Large collaborative knowledge base consisting of data composed mainly by its community members. Downloadable data dumps are still available.
- Metaweb: Described as an “open, shared database of the world’s knowledge”, the company developed Freebase, was acquired by Google in 2010 and subsequently made most of the data available to Wikidata.
- Cyc: Common sense knowledge base: vast quantities of fundamental human knowledge: facts, rules of thumb, and heuristics for reasoning about the objects and events of everyday life. Originated in 1984 by Douglas Lenat. Partial open-source version available through OpenCyc.
- GDelt: Monitors the various news outlet from nearly every corner of every country and identifies the people, locations, organizations, events, etc, thus creating a free open platform for computing on the entire world. Supported by Google Jigsaw.
- DBpedia: Open, free and comprehensive knowledge base constantly improved through a crowd-sourced community effort to extract structured information from Wikipedia.
- YAGO: Semantic knowledge base from the Max-Planck Institute, derived from Wikipedia, WordNet, and GeoNames.
- Wikidata: Project of the Wikimedia Foundation: a free, collaborative, multilingual, secondary database, collecting structured data to provide support for all other Wikimedia projects, and beyond.
- LinkedIn’s Knowledge Graph: Built upon “entities” on LinkedIn, such as members, jobs, titles, skills, companies, geographical locations, schools, etc. forming an ontology of the professional world. Not available.
- OpenIE: Quality information extraction at web scale; toolkit originating from the University of Washington.
- PROSPERA: Hadoop-based scalable knowledge-harvesting engine which combines pattern-based gathering of relational fact candidates.
- Google Knowledge Vault: Knowledge base created by Google.
- ConceptNet: Originated from the crowdsourcing project Open Mind Common Sense, launched in 1999 at the MIT Media Lab, it is a freely-available semantic network.
- WordNet: Nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept.
What’s the big difference between them? Time, money, domain, approach and supporting organization. At the end of the day, the challenges of extracting, disambiguating and linking entities of the world is an open problems. There is no one-size-fits-all solution that truly and automatically makes sense of the knowledge embedded within natural language with all of its subtleties. This led to considerable amount of exploration and specialization. After all, the struggle of making sense of information is one that every individual shares.
Part 2–? coming soon…
Beyond the obvious, the area of Knowledge Graphs entertains a number of open research questions, namely related to: acquisition, growth, aggregation, storage, veracity, time-dependency, search, crowdsourcing, coldstart, deduplication, standardization, harmonization, and language support (to name a few). Stay tuned for the following posts where I’ll attempt to distill some solution from current practices and where we are still failing.