Avengers: Infinity War

Create a Data Marvel — Part 3: Hydrating the Model

Jennifer Reif
Neo4j Developer Blog

--

*Update*: All parts of this series are published and related content available.

Part 1, Part 2, Part 3, Part 4, Part 5, Part 6, Part 7, Part 8, Part 9, Part 10

Completed Github project (+related content)

Over the last couple of weeks, we have shown the early steps of this project — the background, building the data model, and the start of the data import from an API to Neo4j. If you haven’t and want to catch up on that information, you can read the posts for Part 1 and Part 2.

Today, we will continue the process of getting a maximum amount of data from the Marvel Developer Portal API into Neo4j. We saw the initial import Cypher statement last week, but the trickiest piece is the next import statement, and the issues we uncovered in running it.

To recap our limitations and foundations so far, I’ve included an image of our Neo4j graph data model for the Marvel data (yours could differ), as well as some highlights.

  • Marvel API enforces some limitations such as 3,000 calls per day and 100 results per call. So our goal was to get as much data as possible in these limits.
  • Retrieving all comics (43,000+) was too much, so we decided to import characters first.
  • To do that, we used the APOC library pulled characters based on names by letters of the alphabet.
  • Next, we need to retrieve the rest of the data (comics, series, events, stories, creators) to finish populating our graph!
Marvel comics data model in Neo4j

More Importing — “Hydrating” the Model

One of the best things about APOC is the possibilities it opens for different aspects of data import. Remember that apoc.periodic.iterate procedure we used in our last query to loop through each letter of the alphabet and select the characters that start with that letter? Well, we are going to use that procedure again, but in a slightly different way.

This time, we will use the first statement within that procedure to select the characters we added to the database in our previous query, and the second statement will call the API to retrieve the comics for each of those characters. The next query is below with a walkthrough of the syntax. It might look complicated, but don’t worry. It simply builds upon syntax we have already discussed above.

WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS ts
WITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffix
CALL apoc.periodic.iterate(‘MATCH (c:Character) WHERE c.resourceURI IS NOT NULL AND NOT exists((c)<-[:INCLUDES]-()) RETURN c LIMIT 100’,
‘CALL apoc.util.sleep(2000)
CALL apoc.load.json(c.resourceURI+”/comics?format=comic&formatType=comic&limit=100"+$suffix)
YIELD value
WITH c, value.data.results as results
WHERE results IS NOT NULL
UNWIND results as result
MERGE (comic:ComicIssue {id: result.id})
ON CREATE SET comic.name = result.title,
comic.issueNumber = result.issueNumber,
comic.pageCount = result.pageCount,
comic.resourceURI = result.resourceURI,
comic.thumbnail = result.thumbnail.path +
”.” + result.thumbnail.extension
WITH c, comic, result
MERGE (comic)-[r:INCLUDES]->(c)
WITH c, comic, result WHERE result.series IS NOT NULL
UNWIND result.series as comicSeries
MERGE (series:Series {id: toInt(split(comicSeries.resourceURI,”/”)[-1])})
ON CREATE SET series.name = comicSeries.name,
series.resourceURI = comicSeries.resourceURI
WITH c, comic, series, result
MERGE (comic)-[r2:BELONGS_TO]->(series)
WITH c, comic, result, result.creators.items as items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (creator:Creator {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET creator.name = item.name,
creator.resourceURI = item.resourceURI
WITH c, comic, result, creator
MERGE (comic)-[r3:CREATED_BY]->(creator)
WITH c, comic, result, result.stories.items as items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (story:Story {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET story.name = item.name,
story.resourceURI = item.resourceURI,
story.type = item.type
WITH c, comic, result, story
MERGE (comic)-[r4:MADE_OF]->(story)
WITH c, comic, result, result.events.items AS items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (event:Event {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET event.name = item.name,
event.resourceURI = item.resourceURI
MERGE (comic)-[r5:PART_OF]->(event)’,
{batchSize: 20, iterateList:false, retries:2, params:{suffix:suffix}});

To help process this lengthy query, we will break it up into sections and explain each block. Let us start with the first two sections together.

//First section
WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS ts
WITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffix
CALL apoc.periodic.iterate(‘MATCH (c:Character) WHERE c.resourceURI IS NOT NULL AND NOT exists((c)<-[:INCLUDES]-()) RETURN c LIMIT 100’,
‘CALL apoc.util.sleep(2000)
CALL apoc.load.json(c.resourceURI+”/comics?format=comic&formatType=comic&limit=100"+$suffix)
YIELD value

Just as with our initial load query, we start the query using the WITH clause to set up and pass the timestamp and url suffix parameters that we will use further down. The next lines of code calls the familiar apoc.periodic.iterate to pull all the characters in Neo4j (MATCH statement). Notice the criteria starting from the WHERE clause. We check the Character nodes to see if the resourceURI field contains a value. Marvel puts the url path for most entities in the resourceURI field, so this is a simple check to see if the Character has a url path for us to retrieve data. If it doesn’t, our call will fail, and it won’t find data.

* Hint: this is also a good way to trim the number of API calls. If we know a call will fail, then we should not waste precious resources on it. :)

The next criteria checks if a relationship type of INCLUDES already exists for the node. This sees if we have already retrieved and inserted comics for a character. If a relationship exists, then we do not pull the comic info for that character again. This avoids duplicate calls for entities where we have already added that information.

Finally, we add a LIMIT 100 to that query to only pull 100 characters at a time from our Neo4j database. We ran across issues where queries would timeout because the server would stop responding. Marvel’s server instances probably have a timeout value to ensure users do not hog resources. Or, it could simply be that they need to bolster the architecture a bit to support heavier requests. ;)

Either way, we wanted to reduce the time taken to pull in batches of data, so my colleague suggested a LIMIT clause to create smaller processing for each call. While this would increase the number of calls made to the database, it was better than larger batches failing frequently.

* Note: at this point, we had 26 calls to load all of the characters for each alphabet letter. That gave us around 1,000 characters in our Neo4j instance. If we pull 100 at a time, we could have a maximum of 11 batches of up to 100 calls (one for each character in a batch).

The second statement within the apoc.periodic.iterate adds 2 seconds of sleep between calls to the API for each character. This tries to avoid the timeout of our Marvel server in the middle of one of our calls. Once we wait, we use the apoc.load.json to hit the API endpoint for the comics pertaining to that character from the resourceURI field on our Character nodes. Again, we yield back the JSON object (YIELD value).

//Second section
WITH c, value.data.results as results
WHERE results IS NOT NULL
UNWIND results as result
MERGE (comic:ComicIssue {id: result.id})
ON CREATE SET comic.name = result.title,
comic.issueNumber = result.issueNumber,
comic.pageCount = result.pageCount,
comic.resourceURI = result.resourceURI,
comic.thumbnail = result.thumbnail.path +
”.” + result.thumbnail.extension
WITH c, comic, result
MERGE (comic)-[r:INCLUDES]->(c)

To start the next code paragraph, we check that the subsection containing our comics is not null, and unwind the object to get the subsection. Just as we did with the loading of characters, we use MERGE on the comic id to find or create the ComicIssue node and the ON CREATE SET clause to set property values if the node is created. The one thing that is different is the WITH statement and the MERGE line after it. Because we need to use the newly-created node and create a relationship between it and a Character node, we need to pass the ComicIssue node to the next MERGE statement. We do this using WITH.

* Note: The creation of the node and the relationship between a comic issue and a character is not done in a single MERGE because Cypher would merge on the entire pattern. If both nodes (with all properties) AND the relationship do not exist, then Cypher creates the entire pattern new, often causing duplicates. To avoid this, you need to match any existing entities, then create a new relationship or node using the existing information.

//Third section on Series, Creators, Stories, Events
WITH c, comic, result WHERE result.series IS NOT NULL
UNWIND result.series as comicSeries
MERGE (series:Series {id: toInt(split(comicSeries.resourceURI,”/”)[-1])})
ON CREATE SET series.name = comicSeries.name,
series.resourceURI = comicSeries.resourceURI
WITH c, comic, series, result
MERGE (comic)-[r2:BELONGS_TO]->(series)
WITH c, comic, result, result.creators.items as items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (creator:Creator {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET creator.name = item.name,
creator.resourceURI = item.resourceURI
WITH c, comic, result, creator
MERGE (comic)-[r3:CREATED_BY]->(creator)
WITH c, comic, result, result.stories.items as items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (story:Story {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET story.name = item.name,
story.resourceURI = item.resourceURI,
story.type = item.type
WITH c, comic, result, story
MERGE (comic)-[r4:MADE_OF]->(story)
WITH c, comic, result, result.events.items AS items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (event:Event {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET event.name = item.name,
event.resourceURI = item.resourceURI
MERGE (comic)-[r5:PART_OF]->(event)’,

The next paragraphs of code follow a similar pattern for each of the Series, Creator, Story, and Event node labels. One thing that is different is setting the id field for each of those nodes. The toInt(split(item.resourceURI,"/")[-1]) code trims out the id from the url path that Marvel gives for each entity with split(item.resourceURI,"/")[-1]) and then converts that value to an integer for the id field in Neo4j using the toInt() function.

The rest of the syntax should be familiar from our load of characters earlier. Now we have some of each entity in the database, but I want to add a few more details to a couple of node labels.

Filling in More Details

If you take another look at our last Cypher block, you can see that the Series, Creator, Story, and Event node types only have basic information. Each has an id field, as well as the resourceURI and name fields. This doesn’t give us as much info as Marvel offers, and we could run some interesting queries on some of the other fields provided.

For our project, the only additional fields I wanted were on the Series and Event nodes for some images and start and end dates. Let’s go ahead and get those imported to Neo4j!

These statements are much simpler than the previous two, and we can use the same syntax and logic from our previous queries. The two queries to load these details are shown below, along with brief explanation.

//load any extra Series data
WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS ts
WITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffix
CALL apoc.periodic.iterate(
‘MATCH (s:Series) WHERE s.resourceURI IS NOT NULL
AND not exists(s.startYear) RETURN s LIMIT 100’,
‘CALL apoc.util.sleep(2000)
CALL apoc.load.json(s.resourceURI+”?limit=100" + $suffix)
YIELD value
WITH value.data.results as results
WHERE results IS NOT NULL
UNWIND results as result
MERGE (series:Series {id: result.id})
SET series.startYear = result.startYear,
series.endYear = result.endYear,
series.rating = result.rating,
series.thumbnail = result.thumbnail.path +
”.” + result.thumbnail.extension’,
{batchSize: 20, iterateList: false, params: {suffix:suffix}});//load any extra Event data
WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS ts
WITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffix
CALL apoc.periodic.iterate(
‘MATCH (event:Event) WHERE event.resourceURI IS NOT NULL
AND NOT exists(event.start) RETURN DISTINCT event LIMIT 100’,
‘CALL apoc.util.sleep(2000)
CALL apoc.load.json(event.resourceURI+”?limit=100"+$suffix)
YIELD value
UNWIND value.data.results as result
MERGE (e:Event {id: result.id})
SET e.start = result.start,
e.end = result.end’,
{batchSize: 20, iterateList:false, params: {suffix:suffix}});

Both of these queries use the WITH clause to set up the timestamp and suffix parameters, just as we did with our other queries. The next code block in each query unwinds the JSON object and finds the nested data object, and uses MERGE to find or create the nodes. This time, we just use the SET clause to update or add any values that may not have existed before. Finally, the last code line sets the configuration parameters.

What I Learned

We now have plenty of data in our Neo4j graph database, and we can start to run some queries to check out the information that is there!

After much trial, error, and troubleshooting, the queries given and explained above and in the Part 2 post are what I used to get the data from a finicky API hosted by Marvel into my local instance of Neo4j as a graph data model. Below is a list of my key takeaways from the total data import process up to this point.

  1. It took a lot of messed up Cypher queries and clearing out the database multiple times to start the import over from a clean slate.
  2. It took several questions to expert teammates to tweak and optimize what I wanted to get and determine that some issues were related to the API on Marvel’s end and not poorly constructed Cypher on my end.
  3. I learned a LOT from an actual data set. I wasn’t playing with a cherry-picked data set that was clean and road blocks removed for easy learning. I had to come up with ways around new, inventive issues based on the data set I was dealing with.
  4. I was surprised how long the import step took. It wasn’t the actual import execution that was problematic. It was determining how I wanted the data to look and how to craft the Cypher to meet that expectation.
  5. Until you deal with a data set and technology hands-on, it is hard to break past a certain point of learning. I could read and read and read, but I needed to actually work through these problems and see what my queries did with the data to comprehend how the procedures worked and what the data looked like in Neo4j.

Next Steps

While the initial stages of this project may not have been overly glamorous, they are necessary to every project you encounter, whether personal or occupational. In all honesty, I enjoyed the learning process in these steps.

The next posts will start looking at crafting the application and how to use Spring Data Neo4j to produce powerful, capable, and pretty code and webpage! Stay tuned!

Resources

--

--

Jennifer Reif
Neo4j Developer Blog

Jennifer Reif is an avid developer and problem-solver. She enjoys learning new technologies, sometimes on a daily basis! Her Twitter handle is @JMHReif.