Mapping JSON to graph with GraphXR

Weidong Yang
Kineviz
5 min readAug 17, 2022

--

JSON is a very popular and versatile data format. Its nested tree structure is essentially a graph. In fact JSON is often used to store data that is graph in nature. How wonderful if we can directly represent JSON data in graph and visualize it!? In this blog we will discuss some techniques in mapping JSON to graph, and the challenges associated with it.

Compared to CSVs, where header rows dictate the data columns for each entry, the JSON format does not enforce limits on the fields in each object. A programmer might want to enforce some sort of schema to make the data format consistent though, making it possible to manage the long term stability of the process that feeds on that data stream.

Technically, the JSON data structure allows adding a field whenever needed. This addresses one of the limitation in CSV format. For example, the US department of transportation releases a 2-week delayed CSV of all flight data. Because some flights get diverted, a few of them multiple times, the CSV table needs to have multiple “diversion” columns to record those edge cases. Even if the majority of flights never got diverted, every one of them has to have multiple “diversion” columns. Stored in JSON data structure however, this is no longer necessary.

While CSV is a flat table, JSON is nested like a tree. However, data stored as JSON often has high redundancy. We can find a good example from Twitter’s streaming API. A tweet object contains a user and text content. A retweet contains a user, text content, and the original tweet’s user and text content, even though the original tweet also exists as independent record. This redundancy improves the efficiency in data retrieval, but creates extra effort in maintaining data consistency. Say an object A is referenced by many objects. If one field in object A needs to be updated, we need to look up all the objects that reference it and update all of them.

Redundancy aside, JSON objects can often be interpreted as a graph. In the above example, this can be expressed in cypher as:

(t:Tweet)-[:RETWEET]-(t1:Tweet)

One common use of JSON data structure is in logging. Each record/line is a nested JSON object, with new records appended at the end of the file. Converting such log files into a graph helps visually reveal insights. Data received through the Twitter streaming API is a great example. Here are a few US presidential election-related tweets logged on 2020–06–30:

Each line is a logged tweet. Now looking into a specific tweet:

We can see that at high level, the data is expressing:

(t1:Tweet)-[:BY_USER]->(u:USER)

(t1:Tweet)-[:IS_RETWEET_OF]->(t2:Tweet)

Where a retweet is represented as “retweeted_status”.

Let’s take 10 tweets and use GraphXR to map them into a graph. Along the way we will see some challenges posed by the lacking of constraint on schema. Although we use GraphXR to demonstrate the approach, the technique described here is by no means limited to GraphXR. In fact, you can use any programing tool to perform the same process. GraphXR just provides a no-code approach to it.

Let’s drag and drop a log file with 10 tweets into GraphXR. We name the file as Tweet.json. It results in ten tweets being created with an attached tree representing the full contents of the log.

We notice that many categories and relationships are created. They represents fields in the JSON object that themselves are JSON objects, i,e, nested objects. Not all of them are useful. For example, “sizes” is probably not of any interests. Let’s select them and their children and delete them.

Cleaner now. Next step, entities and extended_entities are just holders for linking to urls, media, and hashtags. So (entities and extended_entities) are just connecting (Tweet, extended_tweet, retweeted_status) to (url, media, hashtag).

Using Transform=>Shortcut, we can change

(t:Tweet)-[:TWEET_ENTITIES]->(e:Entity)-[:ENTITY_URL]->(u:Url)

to

(t:Tweet)-[:HAS_URL]->(u:Url)

And remove entities/extended_entities nodes

This is much improved. This is much improved. We notice that nodes in the retweeted_status category should be in the Tweet category. We can Transform > Extract the retweeted_status category into the Tweet category with “Inherit Relationships” checked:

Here we have Tweets highlighted, showing retweet and quote relationship.

As we mentioned before, JSON is highly redundant. So here we have duplicate Tweets, users, media, etc. We can use Transform=>Merge to clean them up.

Now we have a graph that shows all the connections and relationships. The graph can be simplified further, but we get the gist of it.

Thoughts:

Due to the flexible nesting nature of JSON structure, mapping it to graph is not always direct and easy. Users will have to make some judgement on what to keep, what to change, and what to merge. So unfortunately it’s a case-by-case process. Nevertheless, by following the steps of Shortcut, Merge, and Delete you can get a highly informative graph.

JSON belongs to a class of data structures that are close to Graph and are commonly used in many applications, like project management, case management, design diagram, etc. There are significant benefits by mapping them to managed graph like LPG. We will explore more examples in later blog posts.

Although the process presented here is tedious, it can be programmed and automated with the grove extension in GraphXR, which is significantly easier and faster. Also, it can be used as a managed pipe to ETL log data to a graph database, like Neo4j, on a continuous base.

--

--

Weidong Yang
Kineviz
Editor for

Weidong is an entrepreneur, scientist, programer and artist. He founded Kineviz and Kinetech Arts.