Create a Data Marvel: Develop a Full-Stack Application with Spring and Neo4j — Part 2

Jennifer Reif
Dec 5, 2018 · 8 min read

Last week, we covered the start of this project — the background, the API evaluation process, and the data model creation. If you haven’t and want to catch up on that information, you can read the post for .

This week, I want to talk about how we actually imported the data. Data is extremely messy, so I will admit that these steps in the process took the longest to wade through and finally come up with import statements that worked and produced clean, unified data to match our data model (image below for review).

Marvel comics data model in Neo4j

Data Import

Now that we created a sensible data model, we needed to import the data from the API into Neo4j to see if the model actually worked or if we needed to adjust it. Often, you will not know for certain if a data model works until you start reviewing the real-world data in that model. This is where Neo4j’s flexible data modeling is really valuable, allowing us to create and tweak the model, as well as refactor the data to match any model changes.

For our project, the data resided in an API, and we needed to call the endpoints to retrieve segments and insert them into our Neo4j model. Enter API difficulties!

Some public APIs will limit the number of calls a user can make per day in order to ensure the data is accessible and service remains stable for everyone. Marvel is no exception to this rule. They limit each user to 3,000 calls per day. This number may not seem small, but when you look at the number of comics in the system (around 43,122) and each of the other entities attached to those, that number quickly depletes!

So, we needed to be a bit more creative in our approach to pulling data. We could look at comics starting with each letter of the alphabet. That would mean 26 calls (26 letters in English alphabet) to retrieve them in batches. However, when we ran some testing in Marvel’s interactive documentation, we found that each call would only return a maximum of 100 results, and there was no way to page through for more.

Since there are over 43,000 comics in the database, divided by 26 letters in the alphabet, there would be massively more than 100 results — especially for letters like “T” (for comics starting with ‘the’) and other common ones. Pulling by comic was not going to work.

Goal: we are trying to get maximum information within 100 results and under 3,000 call limit!

What if we pulled by comic character? There would be fewer characters than comics, since a single character often has many comics associated with them. There would definitely be less than 100 superheroes for many letters of the alphabet. We checked each letter to see how many total results it found, and there were only a couple of letters that found more than 100 characters (“M” and “S”, to be specific). This gets us as close as we can to everything!

Retrieving characters wouldn’t allow us to retrieve information for other entities, but it got our initial group for retrieving the rest of the data. Great! Now what do we use to get the data from one source to another?

Tools for Neo4j Import

There are a variety of tools available for all kinds of import scenarios. One of the best options is to use the (Awesome Procedures on Cypher) standard library that includes utility procedures and functions for various capabilities. For this API import, we used apoc.load.json procedure, which reads the JSON-formatted responses from the API and uses Cypher statements you write for specifying how you want to insert or update the data in Neo4j.

Because APOC is an extension library, you can either download the GitHub project or install it within the Neo4j Desktop application and use it just as you would any other Cypher procedure or function. For this scenario, we opened and read some sample data from the API to test the results. Then, we added Cypher statements to craft the data into the shape of our graph model.

Let’s Import Characters!

Before we start importing data, though, we want to set our API keys (private and public key for Marvel API) as parameters, so that we can use variables in our calls and not copy/paste the key each time or accidentally expose the keys to others reviewing our code. This is easy enough in Neo4j. Simply call the params keyword and type in the parameter name you want and the value you want it to be. The statement is shown below.

:params {marvel_public:<your public API key here>,
marvel_private:<your private API key here>}

* Tip: if you are calling the same API endpoint often, you could also set up a URL parameter and reference that variable in any Cypher statements. It didn’t make sense in our case because we call various endpoints to retrieve different pieces of information, so the URL changes.

Finally, we are ready to start loading the first bit of data for the comic characters! The entire statement is in the code block below. We will walk through each section of this statement in the next paragraph.

WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS ts
WITH “&ts=” + ts + “&apikey=” + $marvel_public +
“&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffix
CALL apoc.periodic.iterate(
‘UNWIND split(“ABCDEFGHIJKLMNOPQRSTUVWXYZ”,””) AS letter RETURN letter’,
‘CALL apoc.load.json(“http://gateway.marvel.com/v1/public/characters?nameStartsWith=“+letter+"&orderBy=name&limit=100"+$suffix)
YIELD value
UNWIND value.data.results as results
WITH results, results.comics.available AS comics
WHERE comics > 0
MERGE (char:Character {id: results.id})
ON CREATE SET char.name = results.name,
char.description = results.description,
char.thumbnail = results.thumbnail.path +
”.” + results.thumbnail.extension,
char.resourceURI = results.resourceURI’,
{batchSize: 1, iterateList:false, params:{suffix:suffix}});

The first line of the code above starts by setting a couple of parameters for the timestamp and some required tokens for the end of the URL string. The WITH keyword allows you to pass along results or values from one part of a query to the next. In this case, we are setting these parameters and passing those to the main part of the query in the next couple of lines.

The next section of code calls the APOC procedure apoc.periodic.iterate. This was suggested by my colleague to list all of the letters of the alphabet and let the procedure loop through them automatically (rather than executing a statement for each letter manually). The procedures takes the first statement within the parentheses and, for each result returned from that statement, executes the second statement. In this case, it unwinds all the letters in the alphabet (first statement: UNWIND...RETURN letter) and retrieves data from the API for each letter (second statement: CALL apoc.load.json...).

Let’s dive into a bit more detail about the apoc.load.json and what that bit of code is doing. First, the procedure expects a few parts to its structure. We start with the API endpoint url we need for characters and include a few url parameters Marvel needs to retrieve the data.

The nameStartsWith= url parameter allows us to insert the starting letter from our previous statement loop and retrieve characters who have a name starting with that letter. We also sort the characters in alphabetical order and include the limit of 100. The API defaults to 20 in the return results, so we want to ensure it pulls the maximum number and not the default.

The YIELD value syntax brings back the JSON object from the called endpoint. Now, we need to sift through the object and insert the data to Neo4j. The Marvel API result gives us some high-level details (number retrieved, call status, etc) and nests the character data under a results[] section, so the next line unwinds that object and navigates the nested structure to get the subsection that has the character data.

Once we have gotten here, we pass that object (using WITH) to the WHERE criteria to check if the characters have any comics or not. Our goal is to look at Marvel’s comics and how the comics relate to other types of entities. If a character does not have any comics, then it will not have any relationships to other entities, and therefore is not meaningful data for this project. For our project, we only cared about the characters who have comics because we wanted to focus on the relationships between entities.

The next section of code actually inserts the data into Neo4j using Cypher statements. MERGE does a select-or-insert function where it checks if the node exists in the database before trying to create it. It looks for a node of type Character based on the id coming from the API. If the node exists, it just returns it. If the node does not exist, it creates the new node and sets any properties. ON CREATE SET only executes if the node does not already exist and adds the properties to a newly-created node. This ensures no duplicates get inserted for characters.

Finally, the last code line has some config options you can set for the apoc.periodic.iterate procedure we used earlier in the statement. You can check out the configuration options in the . The configuration I want to point out is the params one. This simply allows the procedure to use the parameter set outside the procedure call (suffix was set outside the CALL apoc.load.json()).

* Tip: If you forget to add parameters to the params config that you want the internal statement to use, you will see an error.

Running that statement will insert all the characters! If you run a couple of quick queries like the ones below, you can verify that data was inserted and that the values look good and the translation worked.

MATCH (n:Character)
RETURN n LIMIT 25;
MATCH (n:Character)
RETURN count(n);

Ok, everything looks good, and we now have plenty of characters to work with!

What I Learned

After quite a bit of research and query tests, the query given and explained above is what we used to get the first round of character data from a finicky API hosted by Marvel into our local instance of Neo4j as a graph data model. Below are my key takeaways from this part of the data import process.

  1. It took a lot of time to maneuver through the Marvel Developer Portal restrictions on their API and find the best approach to gathering as much data as possible within the bounds.
  2. My Cypher improved a LOT from seeing the data go from one source to another and finding where I hadn’t expressed the Cypher correctly.
  3. You need a practical example to apply what you have learned. Even finding something small and simple (which this was not) will allow you to experiment with a data set hands-on and gain deeper understanding.

Next Steps

In the next post, we will cover the remaining steps to import the rest of the Marvel data set and show the fully-populated database. Things get even more fun once we have all the data to play with. Stay tuned for more info!

Resources

Neo4j Developer Blog

Developer Content around Graph Databases, Neo4j, Cypher, Data Science, Graph Analytics, GraphQL and more.

Jennifer Reif

Written by

Jennifer Reif is an avid developer and problem-solver. She enjoys learning new technologies, sometimes on a daily basis! Her Twitter handle is @JMHReif.

Neo4j Developer Blog

Developer Content around Graph Databases, Neo4j, Cypher, Data Science, Graph Analytics, GraphQL and more.