Neo4j: Natural Language Processing (NLP) in Cypher

David Allen
Neo4j Developer Blog
7 min readNov 15, 2017

In Neo4j 3.0, the database introduced the idea of user-defined procedures that you could call directly from the Cypher language. Prior to this, cypher was already a pretty good language, but things really started blowing up, with people writing code on top of neo4j that lets you do just about anything in Cypher directly. This article gives an example of using Natural Language Processing (NLP) inside of Cypher to show how you can draw meaning out of text in graphs, aimed at people who may be new to NLP.

As an example, we’ll work through how to find out about positive and negative sentences in Donald Trump’s twitter feed, showing techniques that can be used on any text. To do this, I used both the APOC procedures for neo4j, and GraphAware’s neo4j-nlp procedures and installed the relevant JARs into the plugins directory for neo4j.

Neo4j-NLP Setup: make sure to follow the directions on the github page. You will need at least 4 JARs, and to add some configuration to neo4j.conf. Finally, after starting the database you’ll need to create a default pipeline step, which is covered in their setup documentation.

For this tutorial, the steps we’ll go through:

  1. Load data on tweets
  2. Break up hashtag / user replies and link them to the tweets
  3. Apply some basic NLP approaches to tag which words and concepts he’s tweeting about
  4. Apply some sentiment analysis provided by neo4j-nlp to determine what he’s feeling positive about, and what he’s not so positive about.

Step 1: The apoc.load.json function lets us get data into neo4j directly from JSON. The data URLs come from the excellent Trump Twitter Archive and are kept up to date with all of his tweets. By unwinding an array of data URLs, we can load all of the files in a single shot.

CREATE INDEX ON :User(name);
CREATE INDEX ON :Tweet(text);
CREATE INDEX ON :Hashtag(name);
UNWIND [
'http://www.trumptwitterarchive.com/data/realdonaldtrump/2019.json',
'http://trumptwitterarchivedata.s3-website-us-east-1.amazonaws.com/data/realdonaldtrump/2018.json',
'http://trumptwitterarchivedata.s3-website-us-east-1.amazonaws.com/data/realdonaldtrump/2017.json',
'http://trumptwitterarchivedata.s3-website-us-east-1.amazonaws.com/data/realdonaldtrump/2016.json',
'http://trumptwitterarchivedata.s3-website-us-east-1.amazonaws.com/data/realdonaldtrump/2015.json',
'http://trumptwitterarchivedata.s3-website-us-east-1.amazonaws.com/data/realdonaldtrump/2014.json',
'http://trumptwitterarchivedata.s3-website-us-east-1.amazonaws.com/data/realdonaldtrump/2013.json',
'http://trumptwitterarchivedata.s3-website-us-east-1.amazonaws.com/data/realdonaldtrump/2012.json',
'http://trumptwitterarchivedata.s3-website-us-east-1.amazonaws.com/data/realdonaldtrump/2011.json',
'http://trumptwitterarchivedata.s3-website-us-east-1.amazonaws.com/data/realdonaldtrump/2010.json',
'http://trumptwitterarchivedata.s3-website-us-east-1.amazonaws.com/data/realdonaldtrump/2009.json'
] AS url
CALL apoc.load.json(url) YIELD value as t
MERGE (s:Source { name: t.source })
CREATE (tweet:Tweet {
id_str: t.id_str,
text: t.text,
created_at: t.created_at,
retweets: t.retweet_count,
favorites: t.favorite_count,
retweet: t.is_retweet,
in_reply: coalesce(t.in_reply_to_user_id_str, '')
})
CREATE (tweet)-[:from]->(s)
RETURN count(t);

This gives us all 37,000+ tweets that Donald Trump’s account sent, starting in 2009. Our graph right now is extremely simple, only connecting a single tweet to the source it was sent from, like this:

Step 2: We can go ahead and punch up the data a little bit by extracting hash tags and user mentions from the data, creating those as separate nodes, and linking them to the relevant tweets, like this:

/* Hashtag Analysis */
MATCH (t:Tweet)
WHERE t.text =~ ".*#.*"
WITH
t,
apoc.text.regexGroups(t.text, "(#\\w+)")[0] as hashtags
UNWIND hashtags as hashtag
MERGE (h:Hashtag { name: toUpper(hashtag) })
MERGE (h)<-[:hashtag { used: hashtag }]-(t)
RETURN count(h);
/* User Mention Analysis */
MATCH (t:Tweet)
WHERE t.text =~ ".*@.*"
WITH
t,
apoc.text.regexGroups(t.text, "(@\\w+)")[0] as mentions
UNWIND mentions as mention
MERGE (u:User { name: mention })
MERGE (u)<-[:mention]-(t)
RETURN count(u);

Looking at an individual tweet, we can now see it’s linked appropriately. By hashtag topic and by user mention then, we can look at who Donald Trump tweets to the most often, and what topics he’s most often tweeting about.

A simple tweet related to hashtags and user mentions

Step 3: OK, to get some meaning out of this raw text, we’ll first need to “annotate” the language. What’s happening here is that the “text” of the tweet is getting broken up into individual words with certain common words eliminated, by using English grammar it will work out what’s a noun and what’s a verb, and it will store the structure of the sentence in the neo4j graph. Doing all of this is simple because of the neo4-nlp extension in cypher, and works like this.

/* Detect language and update each tweet with that information */
MATCH (t:Tweet)
CALL ga.nlp.detectLanguage(t.text)
YIELD result
SET t.language = result
RETURN count(t);
/* Annotate all text that's detected as English, as the underlying library may not support things it detects as non-English */
MATCH (t:Tweet { language: "en" })
CALL ga.nlp.annotate({text: t.text, id: id(t)})
YIELD result
MERGE (t)-[:HAS_ANNOTATED_TEXT]->(result)
RETURN count(result);

All of the heavy lifting is done by GraphAware’s ga.nlp.annotate procedure. This will create a large number of new nodes in your graph. Every Tweet will be associated with an AnnotatedText node, which in turn will be linked further to Tag and Sentence nodes. For example, if we annotate the sentence “See you in the Supreme Court!”, it will be broken down into “tags” like “see” (which is a verb) and “Supreme Court” which is a noun. We can tell because the associated tag node has a property called “pos” (Part of Speech) which has the value NNP (a proper noun). Here’s what the graph looks like for a relatively simple sentence:

Showing the results of annotating text

Tags also occur at particular spots in sentences. So you’ll also see TagOccurance nodes which indicate where in the sentence each tag is seen.

OK, this gets us the words and their functions within the paragraph, but still don’t say much about meaning. We need to go two steps further. The first is to “enrich” the concepts, using the ConceptNet5 API.

MATCH (n:Tag)
CALL ga.nlp.enrich.concept({tag: n, depth:2, admittedRelationships:["IsA","PartOf"]})
YIELD result
RETURN count(result);

This yields a set of additional relationships that relate the tags together, so we can tell what’s similar to what. It also labels a lot of nodes with additional categories, such as NER_Person, NER_Location, NER_Organization, and so on which allows us to classify our tags by what kind of thing they are. At this point, we’re better off than knowing “Supreme Court” is a noun, we know it refers to an organization.

This lets us write queries like the below, which will fetch all of the tweets where the language in the tweet is talking about the person concept “Clintons”.

MATCH (:NER_Person {value: 'clintons'})-[]-(s:Sentence)-[]-(:AnnotatedText)-[]-(tweet:Tweet) 
RETURN distinct tweet.text;
Tweets about the Clintons

Step 4: OK. At this point we’ve broken down the language, and we have some sense of the concepts being discussed. The last step is sentiment analysis. This can be a very in-depth topic, but for the purposes of this article, it’s quite simple: it just applies a positive, neutral, or negative label to each Sentence node in our graph.

MATCH (t:Tweet)-[]-(a:AnnotatedText) 
CALL ga.nlp.sentiment(a) YIELD result
RETURN result;

With this in place, querying for positive and negative tweets is easy. We can simply extend our previous query and tack on a label to find all negative tweets about a topic:

MATCH (:NER_Person {value: 'clintons'})-[]-(s:Sentence:Negative)-[]-(:AnnotatedText)-[]-(tweet:Tweet) 
RETURN distinct tweet.text;

Step 5: Let’s pull it all together. What can we learn about Donald Trump’s tweeting patterns using Neo4j and NLP techniques? How about let’s try a broader query to find the most frequently mentioned people, in negative contexts. We intentionally exclude tags that could be multi-purpose depending on the sentence context, to focus on just the people.

MATCH (tag:NER_Person)-[]-(s:Sentence:Negative)-[]-(:AnnotatedText)-[]-(tweet:Tweet) 
WHERE
NOT tag:NER_Organization AND
NOT tag:NER_O
RETURN distinct tag.value, count(tweet) as negativeTweets ORDER BY negativeTweets DESC;

The results are probably not surprising for those who are familiar with the raw tweets.

Persons mentioned in negative contexts

Although it is interesting that “Donald Trump” is the first result. This is an artifact of a couple of things in our dataset; first that Donald Trump talks about and retweets about himself quite a lot. And second, the sentiment analyzer isn’t perfect, and may rate as negative tweets like “@KingOf_Class: @realDonaldTrump Honestly,you can’t find anyone more real than Donald Trump!”.

Lastly, let’s take a broad view of positive and negative sentiments about various organizations:

MATCH (t:NER_Organization)-[]-(s:Sentence)
WHERE
'Negative' in labels(s) OR
'Positive' in labels(s)
AND length(labels(t)) = 2
RETURN distinct t.value as orgName,
s.text as sentence,
labels(s) as sentiment;
Organizations and the sentiment of the sentences they were mentioned in.

Conclusion

Using NLP techniques inside of neo4j, there is quite a lot that you can do. GraphAware has already used this library as part of building Amazon Alexa skills, where the NLP component gets used to decide which skill the user’s input phrase is most like. In this article, we’ve shown a simple way you can programmtically understand the gist of text using the same approach.

Many other fun applications are possible. A while ago, the Planet Money podcast built a bot that trades stocks based on positive or negative sentiment about companies seen in Donald Trump’s twitter feed using similar approaches (although I’m not certain whether they used neo4j or not). The limits are only how creative you can get.

In this article, I have glossed over a couple points; in particular the quality of the results with the sentiment analyzer can be spotty depending on your input data. For top-notch results, you’ll find yourself going deeper into the world of NLP to include training your own sentence analyzer. It’s a deep rabbit hole, but it’s also rewarding and fun if you’re interested in learning modern approaches to dealing with text analysis.

For a different approach to doing NLP together with Neo4j, make sure to check out Will Lyon’s post on finding Russian Twitter trolls with neo4j and NLP approaches.

--

--