Graphs Part 2: How Graphs are used in Unsupervised Language Learning

In this second part of the series, we take a deeper look at how graphs are used in SingularityNET, OpenCog, and Aigents.

Aigents with Anton Kolonin
SingularityNET
5 min readJul 23, 2018

--

Introduction

The aim of this four-part research series is to cover how we will visualize data by using graphs as network diagrams and data structures.

In the first part of this series: How SingularityNET Will Leverage OpenCog & Aigents, an introductory overview of the different types of graphs was provided. After the overview, some unique capabilities of using OpenCog in SingularityNET’s infrastructure, as a generic graph storage and processing system, were covered. In the final part of the article, we presented some basic graph visualization options supported by the Aigents Graphs framework.

In this part of the series, we explain how graphs are used in the Unsupervised Language Learning (ULL) projects of SingularityNET, OpenCog and Aigents. Our research is based on the initial idea of Ben Goertzel and Linas Vepstas, which has been further progressed by the work of SingularityNET’s AI team.

We hope that the article will not just be of interest to Computational Linguists or those in the Natural Language Processing (NLP) domain, but that it will prove enlightening to a variety of people. It is essential that we first explain the basic concepts behind a Link Grammar Parser.

Our approach, at its core, is based on the use of “disjuncts,a term coming from Link Grammar theory. Simply put, a disjunct is the set of connector types of a word. Disjuncts can represent either specific (like the word “cat”) or abstract (like any animal or any noun) lexical entries in a data structure that involves other lexical entries associated with it in the language of study.

Example of “disjuncts”, identifying different lexical entries for articles, nouns, proper noun and verbs in different tenses, along with illustration of how these “disjuncts” can be composed in a sentence.

While the project is being actively developed on GitHub, we created the ULL project website to provide more details about it. With the help of the Aigents Graphs, we were able to develop the Unsupervised Language Learning Graphs Demo. We will be referencing and sharing screenshots from this Demo in the rest of the article.

Graphs and Unsupervised Language Learning

The first step, in our current ULL approach, is the collection of Mutual Information (MI) between word pairs from the corpus of texts used for language grammar acquisition. The graph below renders a network of mutual information links between the word nodes.

Such a graph can be created online here.

A sub-graph of mutual information links around the verb “is” and pronoun “his” in a sample English corpus, with the width of a visual link representing the amount of mutual information between the words.

An optional step in the ULL pipeline is the Word Sense Disambiguation (WSD) process. In this process, the linguistic contexts of every occurrence of every word are taken into consideration, and then an attempt is made to identify the specific sense of a given occurrence. To better understand the process, we can consider this example which was used to render the graph below.

To understand why the WSD process may be necessary, let us consider some occurrences of ambiguations. For instance, there could be semantic ambiguation of the word “board,” which can be pointing at either a “board of directors,” or the action “to board a ship” or a “white writing board.”

We may also have a grammatical ambiguation, creating two or more possible meanings for a given word. For example, the word “saw” may either refer to the noun indicating a carpenter’s tool or the past tense of the verb “see.”

The graph example of these ambiguities is shown below. Such a graph can also be created online over here.

Sub-graph identifying three different semantic senses of the word “board” and two different grammatical senses of the word “saw,” for the same sample English corpus.

The step after the Mutual Information computation involves the Minimum Spanning Tree (MST) Parsing using the OpenCog MST Parser.

The Parser uses the MI values computed during the first step and tries to parse each input sentence in the corpus, finding the spanning tree that maximizes the amount of mutual information across all links in the parse tree.

The graph below renders the cumulative amounts of word-to-word linkages across multiple parsed sentences, and links between the individual words and input sentences.

Such a graph can be created online here.

Sub-graph of word-to-word linkages for several parses produced by the MST Parser. Most links connect the “LEFT-WALL” (the marker for beginning of the sentence) and end of service period (“.”) as well as the verbs “is”, “was” and “likes”; fewer links occur between these verbs and the adverbs “before” and “now.”

After the parsing phase is done, the produced linkages are used to create disjuncts, which are fed into the Grammar Learner component of the ULL pipeline to try to build a category tree of the words used across the parses.

The following graph provides one of the possible options for rendering such a categorial tree; other options can be tried online as well.

Hierarchical graph representing a categorial tree (not quite precise given that the current ULL technology is not mature enough) for the sample English corpus. One can see “daughter” and “son” are correctly gathered under a super-category, which is further included into super-super-category with “dad” and “mom”; the other groups of nouns are also clearly identifiable.

The final stage of ULL pipeline, also performed by the Grammar Learner, is grammar induction, which infers a grammar for the language represented by the input corpus and exports it in the Link Grammar dictionary format. Details of the dictionary format can be found, along with the code and documentation of the OpenCog Link Grammar project, over here.

The graph below renders one of the possible grammars inferred using the existing ULL Grammar Learner code . The graph can be tested online over here.

Graphical representation of the Link Grammar for sample dictionary learned programmatically by the ULL Grammar Learner. The graph consists of 4 horizontal layers. The bottom layer contains words while the layer above it contains categories of words corresponding to Link Grammar rules or lexical entries. The third layer contains disjuncts agglomerating combinations of grammatical connections between categories of words (called “connectors”). The topmost layer contains the connectors, connecting different word categories in contexts identified by their respective disjuncts.

All of the graphs above have been rendered with SingularityNET ULL Graphs powered by the Aigents Graphs, available on GitHub under the Open Source MIT License.

For more information on ULL Graphs, watch the following introductory video.

Unsupervised Language Learning Graphs are based on the Aigents Graphs framework used in SingularityNET and OpenCog.

In one of the subsequent publications, we will learn about the OpenCog hyper-graphs and meta-graphs. For instance, the hierarchy of OpenCog Atoms can be explored with this online graph demo, partially rendered on the following sub-graph.

Partial hierarchy of OpenCog Atom Type hierarchy rendered straight from the sources in OpenCog GitHub.

What’s next?

In the next parts of this series we discuss Reputation Graph analyses and talk more about the representations with Aigents Graphs and Aigents Reputations.

You can visit our Community Forum to chat about the research mentioned in this post. Over the coming weeks, we hope to not only provide you with more insider access to SingularityNET’s groundbreaking AI research but also to share with you the specifics of our development.

For any additional information, please refer to our roadmaps and subscribe to our newsletter to stay informed regarding all of our developments.

--

--

Aigents with Anton Kolonin
SingularityNET

Creating personal artificial intelligence and agents of collective intelligence for individuals and small businesses.