An Open Domain, General Knowledge Graph Database and Question and Answering(Q&A) system, with Natural Language Understanding and Artificial General Intelligence aspirations (maybe)…..

10 min readSep 1, 2019

Background

A few years back, after evaluating of a number of leading edge AI platforms, I questioned how difficult it would be to create my own Q&A system? Given the increasing amount of Open Source KB content and AI / ML related code now freely available, I thought that I would attempt to build a system of this type. The following article details my privately funded work. The resulting system is primarily an inference capability, using pre-trained AI/ML S/W from a number of Open Source contributors, plus my bespoke ‘glue’ to hold everything together.

This article provides a brief, shallow dive summary of the resulting system.

Experiment

I have spent a couple of years experimenting with Open Source Knowledge Bases (mainly DBpedia dumps) plus other sources of Knowledge Base. After writing a large number of data cleansing and manipulation filters, I read the resulting data into a large Graph DB.

Once I had created a number of effective data filters, it became increasingly easier to convert and ingest more and more data from various internet sources. There was a natural tenancy to ingest just about everything that I came across, even when of quite low value. However, this soon became self-defeating, as the quality of search query results from the Graph soon dropped inline with the quality of the data being ingested. Too much, low quality ‘stuff’ != to a better Graph.

So, rule number one: use as high quality of a data set as possible!

Another rule that I quickly learnt during my initial experiments was; to provide as much richness, in terms of relationships and relationship naming, as possible. This really comes into its own later on when performing inference and query functions. It is much quicker and easier to make sense of the data through the richness of relationships than that of interconnected nodes.

Late last year, partially having been spurred on by a number of Medium articles, I decided to start again with a mainly clean sheet. I modified my ingest and formatting scripts to add as much extra value in as possible when building the Graph, and I began to write a user interface into the Graph, allowing admin functions as well as Natural Language (NL) queries to be run.

So, regarding the approach: the Graph is constructed using Neo4J, which IMHO provides an excellent platform for AI/ML activities, being both performant and parallel. My ingest scripts, after cleansing and formatting the data, write Cypher code; lots of it. The Cypher code creates constructs such as:

Node_A →[unique rel]→ Node_B ; where ‘unique rel’ might be relationship entities such as type_of, located_in, also_known_as etc.

I have over 1000 relationship types which are automatically created by the ingest scripts. Any node can be linked to any other node and each node is unique and may contain many properties.

Ingest scripts are bespoke to each data type; i.e. DBpedia, OpenAI and many other Open Source KB resources. Once the ingest scripts have created their Graph input files (which comprises billions of lines of Cypher code), the input files are then read into the Graph. I normally run 50 to 60 ingest streams in parallel. Due to the amount of data that I am importing, the ingest scripts take about 1 month to create the Graph input files and a further 6 weeks to load these into the Graph DB. Although this sounds like a lot of time and effort, this approach makes it relatively easy to monitor the Graph, as it is being constructed for any issues. Such problems are normally identified by using the Neo4J viewer to look for anomalies. This is, of course, a bit of a ‘needle in a haystack’ approach, but we are mainly looking for endemic issues.

Whenever I find new sources of data to ingest, I test the formatting and ingest scripts in a development environment first, before committing to the production database. Backing out isn’t really an easy option!

Another thing worthy of note, is that the system is multi-lingual, with around 10 languages being incorporated into the DB (with English being the main language).

In addition, the system also has the ability to ingest specific texts, which it creates in a specific domain relating to that text, but linking out to the wider DB. Questions and queries can be directed specifically to these domains and they may be added or removed without any interference to the main DB.

Once the main ingests are complete, we are left with in the region of 70 million nodes, 10 million (and growing) node properties and 1.3 billion relationships.

Ongoing AI/ML Processing

Rather than having a static database, I was keen to construct a number of ongoing processing activities to enhance the DB as far as possible. These activities can be summarized as:

The addition of images to nodes. Node names are run though the Google image search engine, the top ranking image is downloaded into an Object Store, and then linked to the source node as a property link.
The application of the Stanford NLP suite to the nodes. Each node is scanned to determine Sentiment, produce Lemmas and a Name Entity Recognition (NER) function is executed. The output of this processing creates additional nodes, links to existing nodes and relationships which are linked to the source node. This function can greatly increase the capability/effectiveness of the Q&A system.
The addition of RSS News feeds. The system continually scans high-value global news feeds, creating news summaries to which the Stanford NLP processing is applied (as in 2). These news articles are stored in a ‘Temporal’ form in the graph.

This ongoing processing ensures that the system is ‘current’ in terms of its facts, and through the NER function, attempts to further ‘understand’ the contents of the DB.

Question and Answer Interface

This is where Stanford University’s NLP comes to the rescue again with its Core NLP server. I have also written bespoke code to ‘glue’ the system together. In summary, questions are entered in Natural Language, are tokenised, and scanned for NER and Lemmas. The Graph is then searched for nodes Entities that relate to these tokens. A fuzzy approach is applied, as not all tokens will create a ‘hit’. The system then accumulates the resulting nodes and compares them to the question. It then tries to decide how likely the node is to the question. Nodes may in fact be questions in their own right, with answers linked by a relationship. In this event, the question will be matched with the answer. First and second order relationships may be considered where practical. All of the responses are ranked and the highest ranked is selected.

The performance of the system’s ability to answer questions is greatly improved by the ongoing Lemma and NER processing. This ongoing learning and ‘understanding’ approach therefore allows the system to improve its ability to answer questions over a period of time, which for me is a key benchmark and example of ML and AI at work. The system maintains a list of test questions that it runs from time to time. I monitor the answers to these questions to gauge its rate of learning. Of course, things aren’t often that straight-forward. It’s a complex system and sometimes the system gets answers wrong that it previously got right. This is often due to shorter, but incorrect answers being propagated up the rankings stack or also might be the system load. A single question may result in hundreds of queries being submitted into the system concurrently, and may take from under a second, to twenty minutes to run. There are many factors which influence this run-time which I won’t go into in this article, but it is worth noting that the entire Graph fits into the machine’s memory, so IO rates aren’t an issue.

Sometimes queries will timeout and therefore the correct result (answer) will never make it back to the ranking stack. *Perhaps I need a bigger computer?!*

Graphical Output

I use the excellent vasturiano/3d-force-graph code to produce GUI output as a side-product of questions/queries. In the example below, the system is asked to summarize the news of the past #days:

In another example, when the system is asked to ‘think’ about the Earth, this is what it sees:

This model allows the user to rotate, fly through, expand, zoom, etc. the Graph output. Hovering over the nodes reveals the node description. If a node has an image property, then it is displayed; if not, then just a sphere is displayed. Relationships are also displayed and annotated.

Example Questions and Answers

The following raw output shows the extent and nature of the type of questions that can be answered by the system (or not in some cases):

Question is: what is belgium ? The answer is: a kingdom in w europe on the north sea its independence was established in 1831 it is 11792 square miles with a population of 9868000

Question is: what is the population of the usa ? The answer is: the united states has about 300 million people

Question is: what is the capital minnesota ? The answer is: st paul

Question is: what is progress ? The answer is: advancement

Question is: what is business value ? The answer is: something that can make a profit

Question is: was jimi hendrix a skilled guitarist ? The answer is: yes

Question is: are plums purple ? The answer is: yes

Question is: which university has developed an automated process that can track and map active neurons using a a deep learning algorithm for us in behavioral studies ? The answer is: biomedical engineers at duke university have developed an automated process that can track and map active neurons as accurately as a human can but in a fraction of the time this new technique based on a deep learning algorithm addresses a critical roadblock in neuron analysis allowing researchers to rapidly gather and process neuronal signals for real-time behavioral studies

Question is: what have landfill engineers come up with ? The answer is: each year billions end up in landfill but engineers have come up with a way to recover the plastic and other materials inside them

Question is: Is Animalia an illustrated childrens book? The answer is: animalia is an illustrated childrens book by graeme base

Question is: Which person has a cambridge title and a widely recognised expert in the field of Artificial Intelligence? Answer is: taylor a cambridge physicist and a widely recognised expert in the field of artificial intelligence in europe

Question is: Who is set to crush the Jeopardy record with his strategy? Answer is: sports bettor james holzhauers unusual strategy is paying off as he crushes jeopardy records

Question is: What year was the horror film Carrie first screened? Answer is: 1976

And a couple of examples of obvious fails:

Question is: Would Boris Johnson be a good Prime Minister ? Answer is: boris johnson would be a disaster as prime minister

Question is: What happened in Orkney Harbour this week? Answer is: columbus arrived in america

On the whole, I normally see a 90%+ success rate on non-open-ended questions.

Emotional Model

I have started to construct the beginning of an ‘Emotional Model’ to help drive the system. Although very much early work, it seemed clear to me that rather than just creating a Q&A system, AGI would require high-level goals as well as constraints etc.

In terms of goal, the system should:

Continually ingest as much quality knowledge as possible, through the ingestion of ongoing news articles.
Refine this knowledge though NLP processing activities.
Supplement the data through the ingestion of related images.

*Okay, so what about constraints?*

So, for my system, there are some practical constraints:

Temperature — The system must sit within a certain temperature envelope at all times; it was a PUE of 1, as it is located in a facility with no air-conditioning. It, therefore, has to auto-throttle its processing to remain at safe temperatures.
Utilisation — As this is a shared system, it cannot consume all system resources, and therefore has to throttle itself accordingly.

The system constantly scores itself against how well it is doing regarding the above activities.

So, if the system is reading plenty of news articles, the sentiment of the news articles is generally positive, the systems temperature is low and there isn’t much other work in the system (so its utilization is good), then the systems sentiment is generally very positive. There is nothing worse that a grumpy AI system, in which case, this results in it deciding that it doesn’t really want to answer resource hungry NL questions!

Future Aspirations

I have the following ideas for future enhancements:

The addition of Temporal Memories — identifying ‘Novelty’ and therefore record a high amplitude memory event.

2. Image analysis — provide additional properties that can be linked to the database.

3. Broaden the ‘Emotional Model’ and use this to further drive aspects of the system.

4. The addition of new, high-quality data sources i.e. broaden its knowledge further.

5. Add voice input as a question interface and real-time video object classification such that it can recognize me.

Collaboration

Due to its size (in the region of 1TB for data sources, Cypher and the Graph DB), plus the broad nature of its data sources, it is difficult to imagine this project being placed into the Open Source community as it currently exists.

However, I would consider any offers of help and collaboration from interested parties who have experience in the areas mentioned in this article and perhaps wider.

I’m always keen to hear about any innovative ideas regarding enhancements or the application of this technology, so if you have any comments then please don’t hesitate to get in touch @ wayne.powell2001@gmail.com