How to create a Knowledge Graph from your Slack Archive with Neo4j

How to turn your Slack messages into a source of insight and information about conversations, topics, experts and questions using graph queries and analytics.

Adam Cowley
Neo4j Developer Blog
14 min readAug 28, 2020

--

TL;DR: All of the code is available on Github. Follow the steps to export your messages and load them into Neo4j using the import scripts.

If you want to watch me working on this, you can check out this week’s twitch stream.

Earlier in the year, myself and other members of the EMEA Professional Services team put together our own Enterprise Knowledge Graph based on information from our internal Slack Workspace, supplemented with additional information from other sources including Google Drive, Github and Salesforce.

Not the original slack UI but our own clone, based on the graph of slack data

The original blog post published in April 2020 went into detail about the why’s and the what’s, but I thought it was about time to publish some information on the how and provide some information on how you can get started building you own.

Original Blog Post “Discovering Hidden Skills with an Enterprise Knowledge Graph”

What is a Knowledge Graph?

There is a lot of hype and mystery surrounding the term Knowledge Graph, and there are people who will be able to articulate what they are, much more eloquently, so I will leave it to them.

To me, Knowledge Graphs are just Graphs a set of entities linked together to form a taxonomy that represents some kind of knowledge. Commonly, these taxonomies are learned through unsupervised Machine Learning techniques.

In our case, we used Natural Language Processing to create a taxonomy of Skills, Programming Languages, Frameworks ,and Technologies from the unstructured text contained in messages, and the structure of the data around that message.

The internal Neo4j Knowledge Graph Data Model

For example, if a user frequently mentioned a particular term within a channel that was dedicated to a particular skill, we could infer with some assurance that the User had knowledge of the term. If a user is active within a channel dedicated to a particular client or project, they would be a good candidate for anyone who has questions about the project.

By supplementing that information with data from other sources, for example Wikidata entries or StackOverflow Tags, we could take a term like ‘React’ and place it in a hierarchy under Javascript and Frameworks. Someone who is an expert with React may also be able to answer questions about Javascript in general, or more specifically about ES6 or Typescript.

In this article I will outline some of the basic steps that you can take to import the data, then give a short introduction to terms can be extracted with NLP techniques.

Obtaining the Data

Any Admin user for your Slack workspace can export. A comprehensive guide to exporting data can be found on the Slack Help Centre but the TL;DR is that you can request an export by clicking on the Workspace name in the top left-hand corner of the Slack UI, then going to Settings and Workplace Settings (or heading to <your-workspace>.slack.com/admin/settings

If you have the correct permissions, you should see an Import/Export data button to the right of the Settings & permissions header. On the Export data page you can request an export for the last 24 hours, 7 days, 30 days, a specific date range, or the entire history.

The export can take some time but you should receive an email when everything is ready for you.

A list of previously requested exports will be listed on that page and ready for you to download as a zip archive.

Zip Archive Structure

The zip archive contains a set of JSON files organised into folders which represent their channel. The root also contains some additional JSON files which include information on the workspace, most importantly the list of Channels (channels.json) and Users (users.json).

Channel information — channels.json

The channels.json file contains an array of JSON objects, each representing a channel.

An example channel may look a little like this:

{
"id": "C08J11DC6",
"name": "help-cypher",
"created": 1438676304,
"creator": "U08J18KHB",
"is_archived": false,
"is_general": false,
"members": [
"U08J18KHB",
"U08J1KK9T",
"U08J2HAE7",
"U08J2TL6Q",
],
"pins": [
{
"id": "1456911012.000419",
"type": "C",
"created": 1456911025,
"user": "U08J18KHB",
"owner": "U08J18KHB"
},
],
"topic": {
"value": "Ask Cypher Questions in <https:\/\/community.neo4j.com\/c\/neo4j-graph-platform\/cypher>",
"creator": "U08J18KHB",
"last_set": 1535300193
},
"purpose": {
"value": "Help with all questions around Cypher for LOAD CSV. For more questions &amp; support -&gt; Neo4j Community Forums <https:\/\/community.neo4j.com\/c\/neo4j-graph-platform\/cypher>",
"creator": "U7X490V7X",
"last_set": 1535300190
}
}

The information varies slightly by channel but the important features are:

  • id: Slack’s Unique ID for the Channel
  • name: The name given to the channel
  • created: The time that the channel was created (in seconds since Epoch)
  • members: A list of IDs that correspond to the members of the channel (as listed in users.json)

User information — users.json

users.json contains an array of Objects which correspond to a member of the workspace. If we take Michael for example, you can see how the information relates to his public profile on the Neo4j Users Slack.

{
"id": "U08J18KHB",
"team_id": "T08J10GA2",
"name": "michael.neo",
"deleted": false,
"color": "9f69e7",
"real_name": "Michael Hunger",
"tz": "Europe\/Amsterdam",
"tz_label": "Central European Time",
"tz_offset": 3600,
"profile": {
"title": "Caretaker General Community Neo4j",
"real_name": "Michael Hunger",
"real_name_normalized": "Michael Hunger",
"display_name": "michael.neo",
"display_name_normalized": "michael.neo",
},
"is_bot": false
}

I find it hard to believe that Michael isn’t a Robot.

The majority of the interesting information is held in the root of the object, but there is also additional information contained inside the profile object.

Messages — {channel}/{date}.json

The messages are split by date and organised into files by date. For example, to find the messages in the #announce channel on 14 August 2020, I would need to look for announce/2020-08-14.json . The message objects can vary wildly (especially the attachments), so I have made the conscious decision to only import the data that remains consistent:

  • type (and subtype): Denotes the type of message — for example a regular message or a message to say that a user has joined or left the channel.
  • text: Is the content of the message.
  • ts: Timestamp (seconds since epoch) — This also seems to act as a unique ID for the message.
  • reactions: Reactions may be useful to judge the sentiment for the message or even detect which users interact consistently with each other.
  • attachments: The attachments array may contain links to helpful articles, knowledge base entries that answer the question, github repositories, etc.

The first message in the neo4j-users workspace was:

{
"type": "message",
"text": "Hi <@U08J1KK9T> :simple_smile:",
"user": "U08J18KHB",
"ts": "1438677071.000003",
"team": "T08J10GA2",
"user_team": "T08J10GA2",
"source_team": "T08J10GA2",
"user_profile": {
"avatar_hash": "g522bc835d74",
"image_72": "https:\/\/secure.gravatar.com\/avatar\/f522bc835d745f319a9567d63f5c0cc9.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0016-72.png",
"first_name": "Michael",
"real_name": "Michael Hunger",
"display_name": "michael.neo",
"team": "T08J10GA2",
"name": "michael.neo",
"is_restricted": false,
"is_ultra_restricted": false
}
}

Deriving a Data Model

Slack defines the majority of the terms for us already, Users are a member of a Channel. Members can post Messages to any Channel that they are a member of. Those messages can have Attachments. Members can reply to messages to form a thread and also react to messages using emojis.

After a quick scan of the data I ended up with a model similar to the diagram to the left of this text.

Importing the Data

The APOC library provides us with some useful procedures for importing data. In this case apoc.load.json() will provides the ability to load a JSON file into a Cypher statement. It takes a single argument, the location of the JSON (this can either be a remote URI — for example a REST API, or a local path prefixed with file:/// and relative to the import directory as set in neo4j.conf).

CALL apoc.load.json($filename)
YIELD value
// Do something...

For now, the huge import can be loaded from the export but in the future, it might make sense to load the deltas from the Slack API rather than constantly requesting exports.

Note: The scripts below assume that the JSON files have already been copied into Neo4j’s import folder.
If you are using a
Neo4j Desktop instance, you can find the import folder by selecting ‘Manage’ from the menu at the top of the appropriate card within the Desktop UI, clicking the Open Folder button under the database name and navigating to the import folder, the top-level entry would be called slack with the exported files below.

Importing Users

As with all of the queries, this starts with a CALL to the apoc.load.json() procedure. Because the file contains an array of objects, these are streamed into their own row which can be accessed via value.

CALL apoc.load.json("file:///slack/users.json")
YIELD value

First, we can MERGE a node with a :User label —Cypher will attempt to find a node that corresponds to the pattern and if none exists, then it will be created. The SET step then uses a Map Projection to extract certain information from the root value object.

MERGE (u:User {id: value.id})
SET u += value { .name, .title, .color, .real_name }

For certain boolean properties, it makes sense to assign these as a Label rather than a property. For example, assigning an :Admin label will allow us to quickly lookup all Admin Users in the Workspace instead of having to scan through all :User nodes and filtering on a property.

For this, I have used the (admittedly old-school FOREACH hack) but you could just as easily use the apoc.do.when() procedure to execute conditional logic. This hack involves using a CASE statement to either return a collection with a single item if the condition is true, otherwise return an empty collection.

The FOREACH statement then iterates over this collection — executing the containing statement once if the condition is true, otherwise not executing the statement at all.

The following statement will check the is_admin property on the object and if true, set the :Admin label.

FOREACH (_ IN CASE WHEN value.is_admin THEN [1] ELSE [] END | SET u:Admin)

Additionally, it may be interesting to see how users interact across timezones. If the user has a timezone listed, we can extract that out into its own node and merge a relationship between the user and timezone nodes.

FOREACH (_ IN CASE WHEN value.tz IS NOT NULL THEN [1] ELSE [] END |
MERGE (t:TimeZone {id: value.tz})
ON CREATE SET t.offset = value.offset, t.label = value.tz_label
MERGE (u)-[:IN_TIMEZONE]->(t)
)

The full script is available on Github

Importing Channels

Importing channels is similar, so I won’t go into detail. You can view the full script in the Github Repository.

Importing Messages

Importing messages is a little more complicated because of the way the files are split. Essentially, you’ll need to read the directory structure to extract the name of the channel, then pass this along with the filename as parameters to a Cypher query that contains a apoc.load.json() call.

I’ve written a Node.js script to do this but you could also do this in [your language of preference].

// Read directory to get a list of all directories in the path
const files = fs.readdirSync(path)
// Convert each channel name into a full path
.map(channel => [channel, `${path}/${channel}`])
// Filter the array so only directories are included
.filter(([channel, path]) => fs.lstatSync(path).isDirectory())
.map(([channel, path]) => {
// Read directory to get the files for each day
const files = fs.readdirSync(path)
.map(file => `${path}/${file}`.replace(importDir, ''))
return [channel, path, files] })
// Run a reduction to produce an array of objects
// containing the channel and file location
.reduce((acc, [channel, path, files]) => acc.concat(
files.reduce((acc, file) => acc.concat({ channel, file }), [])
), [])

Once the directory listing has been generated, a while loop extracts the first item from the array until there are none left before awaiting the successful execution of a cypher statement before running the next:

// Get the total number to calculate the percentage
const total = files.length
// While there are items in the array, take the next item and execute the import cypher query
while (files.length) {
const next = files.splice(0, 1)[0]
console.log(next, files.length,
`${(100 - (files.length / total) * 100).toFixed(4)}%`);
await session.run(cypher, next)
}
console.log('done');// Once all files have been processed,
// close the driver to exit the process
await driver.close()

The Cypher statement itself uses the $channel parameter to find the Channel node, then appends the filename ($file) to file:/// to load the correct file, and then merge the appropriate nodes and relationships.

const cypher = `
MATCH (c:Channel {name: $channel})
CALL apoc.load.json('file://'+ $file) YIELD value
WHERE value.user IS NOT NULL
MERGE (u:User {id: value.user}) MERGE (m:Message {id: value.ts})
SET m += value {
.type,
.subtype,
.text,
createdAt: datetime({epochSeconds: toInteger(value.ts)})
}
MERGE (u)-[:POSTED]->(m)
MERGE (m)-[:IN_CHANNEL]->(c)
//...
`

The full import script (available on GitHub) produces an output similar to the following, the last number representing the percentage of files processed so far:

$ node index.js
{ channel: 'announce', file: '/slack/announce/2015-08-04.json' } 27226 0.0037%
{ channel: 'announce', file: '/slack/announce/2015-08-05.json' } 27225 0.0073%
{ channel: 'announce', file: '/slack/announce/2015-08-06.json' } 27224 0.0110%
{ channel: 'announce', file: '/slack/announce/2015-08-07.json' } 27223 0.0147%
{ channel: 'announce', file: '/slack/announce/2015-08-08.json' } 27222 0.0184%
{ channel: 'announce', file: '/slack/announce/2015-08-09.json' } 27221 0.0220%
{ channel: 'announce', file: '/slack/announce/2015-08-10.json' } 27220 0.0257%
{ channel: 'announce', file: '/slack/announce/2015-08-11.json' } 27219 0.0294%

The full import script is available on Github.

Extracting Knowledge

So far you may be thinking, “Sure, that’s a graph. But where’s the knowledge?” and that’s a fair question. We could run a simple query to find out who is the most active poster:

MATCH (u:User)
RETURN u.name, size((u)-[:POSTED]->()) AS size
ORDER BY size DESC

It was always going to be Michael…

+----------------------------+
| u.name | size |
+----------------------------+
| "michael.neo" | 23495 |
| "chris.graphaware" | 8567 |
| "andrew.bowman" | 8052 |
| "maxdemarzi" | 5737 |
| "tomasi" | 4759 |
+----------------------------+

Because a Node is aware of its relationships, Neo4j will return a “Degree count” based on the relationship type rather than expanding any records to find this out.

Or, say we have a question on Cypher, we could work out the best channel to ask the question and also which users to @mention:

MATCH (u:User)-[:POSTED]->(m)-[:IN_CHANNEL]->(c)
WHERE m.text contains 'cypher'
RETURN u.name, c.name, count(*) AS messages
ORDER BY messages DESC LIMIT 5
+--------------------------------------------+
| u.name | c.name | messages |
+--------------------------------------------+
| "andrew.bowman" | "help-chat" | 168 |
| "michael.neo" | "help-chat" | 149 |
| "maxdemarzi" | "help-chat" | 116 |
| "michael.neo" | "help-cypher" | 93 |
| "michael.neo" | "neo4j-apoc" | 70 |
+--------------------------------------------+

If I had a question on Cypher, Andrew would 100% be my first port of call.

But these queries would be trivial in most databases. How about we find out the most popular users in the Graph? The golden rule is that relational databases start to slow down when querying with more than 3 joins.

MATCH (u:User)-[:POSTED]->(m:Message)<-[:TO_MESSAGE]-(r:Reaction)<-[:REACTED]-(u2)
WHERE u.name <> 'michael.neo' // Let's exclude Michael
RETURN u.name AS name, count(r) AS reactions,
count(distinct u2) as reacters
ORDER BY reacters DESC
LIMIT 5
+-------------------------------------------+
| name | reactions | reacters |
+-------------------------------------------+
| "maxdemarzi" | 718 | 324 |
| "andrew.bowman" | 531 | 238 |
| "chris.graphaware" | 577 | 234 |
| "tomasi" | 226 | 112 |
| "lyonwj" | 214 | 91 |
+-------------------------------------------+

Max has received 718 reactions from 324 different users. This goes to show how useful Max’s contributions are to the wider audience as a whole.

Identifying n-grams with Cypher

One of the more basic NLP techniques is to extract n-grams, sequences of text (or tokens) of n length, an n-gram with a length of 1 is a bigram, 2 are known as bigrams, 3 are known as trigrams, and so on. The higher the occurrence of an n-gram, the more interesting it becomes.

We can identify n-grams using Cypher and APOC using apoc.text.split() — a function that splits text by regular expression. For example, calling apoc.text.split(“abc.def ghi, jlk|mno”, “\\W+”) will return an array of [“abc”, “def”, “ghi”, “jlk”, “mno”].

Some frequently mentioned terms may also be uninteresting, for example terms like greetings like hello there, hey there or in my case hello mate. We can exclude these by removing stopwords. For now, let’s assume that there is a Cypher parameter called $stopwords to represent these words as an array of strings.

A Cypher query to extract bigrams (n-grams with a length of 2) may look like the following:

MATCH (m:Message) 
WHERE not exists(m.subtype) AND m.text IS NOT NULL
// Split words into sentences
UNWIND split(toLower(m.text), '.') AS sentence
// Split sentences into words
WITH m, apoc.text.split(sentence, "\\W+") AS words
// where the sentence is more than 3 words long
WHERE size(words) > 3
// and all words are between 4 and 20 chars and not a stopword
AND all(w in words WHERE 4 < size(w) < 20 AND NOT w IN $stopwords)
// ensure there are no duplicate words (eg. really really)
AND NOT apoc.coll.containsDuplicates(words)
// Produce a sliding window of results of 2 tokens
UNWIND range(0, size(words) - 2) AS start
WITH m, words, start, start + 2 AS end
// For each n-gram, get the number of occurrences
// and return the top 100 results
RETURN words[start..end] AS ngram, count(*) AS occurrences
ORDER BY occurrences DESC
LIMIT 100

With these stopwords:

:param stopwords => ["hello", "hi", "it", "it's", "its", "append", "thanks", "http", "https", "next_tick", "users", "documents"]

After filtering for products that start with Neo4j, we can see many rows that may signify a Neo4j product. By combining this data with another data source (say a CSV file of product names), we could quite easily build a product hierarchy.

+---------------------------+
| ngram |
+---------------------------+
| ["neo4j", "community"] |
| ["neo4j", "shell"] |
| ["neo4j", "admin"] |
| ["neo4j", "desktop"] |
| ["neo4j", "googlegroups"] |
| ["neo4j", "supports"] |
| ["neo4j", "friends"] |
| ["neo4j", "rstats"] |
| ["neo4j", "graph"] |
| ["neo4j", "lacks"] |
| ["neo4j", "though"] |
| ["neo4j", "spatial"] |
| ["neo4j", "letmein"] |
| ["neo4j", "surveillance"] |
| ["neo4j", "password"] |
| ["neo4j", "import"] |
| ["neo4j", "contrib"] |
| ["neo4j", "encrypted"] |
+---------------------------+

By spotting users who are consistently asking questions around a particular topic, an astute salesman could try and up-sell a training course or a Slackbot could analyse the message and suggest potential Knowledge Base articles or Blog Posts which may help answer the user’s question.

Next Steps

These queries so far only scratch the surface of what is possible. Third-party NLP libraries like Stanford NLP, OpenNLP or services such as the GPC Cloud Natural Language API could be used for more intelligent Entity Extraction. If you are more interested in the subject, head over to the Developer Guide on NLP with Neo4j or check out the NLP Functionality built into APOC.

If you are interested in combining additional data sources, then I’d recommend reading the guide on Importing Wikidata into Neo4j using Neosemantics.

View your own data!

As I’ve mentioned, all of the import scripts are available on Github. I have also created a copy of the Slack UI as a Graph App so you can view the data as if it was the real Slack app.

The Graph App is built with Vue.js with and has also been added to the repository. You can install the Graph App to Neo4j Desktop using the following deeplink:

neo4j-desktop://graphapps/install?url=https://registry.npmjs.org/@graphapps/slack

The public neo4j-users slack data is available in a read-only Neo4j database on demo.neo4jlabs.com with username/password/database slack, which you also can also connect to by choosing the ‘Connect to another graph’ option at the start screen of our slack app.

Keep in Touch

We’d would love to hear how you are using Neo4j to build your Knowledge Graphs. Let us know by posting in the AI, ML, NLP category on the Neo4j Community Site.

Happy slacking, Adam.

--

--