Create a Data Marvel — Part 3: Hydrating the Model
*Update*: All parts of this series are published and related content available.
Part 1, Part 2, Part 3, Part 4, Part 5, Part 6, Part 7, Part 8, Part 9, Part 10
Completed Github project (+related content)
Over the last couple of weeks, we have shown the early steps of this project — the background, building the data model, and the start of the data import from an API to Neo4j. If you haven’t and want to catch up on that information, you can read the posts for Part 1 and Part 2.
Today, we will continue the process of getting a maximum amount of data from the Marvel Developer Portal API into Neo4j. We saw the initial import Cypher statement last week, but the trickiest piece is the next import statement, and the issues we uncovered in running it.
To recap our limitations and foundations so far, I’ve included an image of our Neo4j graph data model for the Marvel data (yours could differ), as well as some highlights.
- Marvel API enforces some limitations such as 3,000 calls per day and 100 results per call. So our goal was to get as much data as possible in these limits.
- Retrieving all comics (43,000+) was too much, so we decided to import characters first.
- To do that, we used the APOC library pulled characters based on names by letters of the alphabet.
- Next, we need to retrieve the rest of the data (comics, series, events, stories, creators) to finish populating our graph!
More Importing — “Hydrating” the Model
One of the best things about APOC is the possibilities it opens for different aspects of data import. Remember that apoc.periodic.iterate
procedure we used in our last query to loop through each letter of the alphabet and select the characters that start with that letter? Well, we are going to use that procedure again, but in a slightly different way.
This time, we will use the first statement within that procedure to select the characters we added to the database in our previous query, and the second statement will call the API to retrieve the comics for each of those characters. The next query is below with a walkthrough of the syntax. It might look complicated, but don’t worry. It simply builds upon syntax we have already discussed above.
WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS ts
WITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffixCALL apoc.periodic.iterate(‘MATCH (c:Character) WHERE c.resourceURI IS NOT NULL AND NOT exists((c)<-[:INCLUDES]-()) RETURN c LIMIT 100’,
‘CALL apoc.util.sleep(2000)
CALL apoc.load.json(c.resourceURI+”/comics?format=comic&formatType=comic&limit=100"+$suffix)
YIELD valueWITH c, value.data.results as results
WHERE results IS NOT NULL
UNWIND results as result
MERGE (comic:ComicIssue {id: result.id})
ON CREATE SET comic.name = result.title,
comic.issueNumber = result.issueNumber,
comic.pageCount = result.pageCount,
comic.resourceURI = result.resourceURI,
comic.thumbnail = result.thumbnail.path +
”.” + result.thumbnail.extension
WITH c, comic, result
MERGE (comic)-[r:INCLUDES]->(c)WITH c, comic, result WHERE result.series IS NOT NULL
UNWIND result.series as comicSeries
MERGE (series:Series {id: toInt(split(comicSeries.resourceURI,”/”)[-1])})
ON CREATE SET series.name = comicSeries.name,
series.resourceURI = comicSeries.resourceURI
WITH c, comic, series, result
MERGE (comic)-[r2:BELONGS_TO]->(series)WITH c, comic, result, result.creators.items as items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (creator:Creator {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET creator.name = item.name,
creator.resourceURI = item.resourceURI
WITH c, comic, result, creator
MERGE (comic)-[r3:CREATED_BY]->(creator)WITH c, comic, result, result.stories.items as items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (story:Story {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET story.name = item.name,
story.resourceURI = item.resourceURI,
story.type = item.type
WITH c, comic, result, story
MERGE (comic)-[r4:MADE_OF]->(story)WITH c, comic, result, result.events.items AS items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (event:Event {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET event.name = item.name,
event.resourceURI = item.resourceURI
MERGE (comic)-[r5:PART_OF]->(event)’,{batchSize: 20, iterateList:false, retries:2, params:{suffix:suffix}});
To help process this lengthy query, we will break it up into sections and explain each block. Let us start with the first two sections together.
//First section
WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS ts
WITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffixCALL apoc.periodic.iterate(‘MATCH (c:Character) WHERE c.resourceURI IS NOT NULL AND NOT exists((c)<-[:INCLUDES]-()) RETURN c LIMIT 100’,
‘CALL apoc.util.sleep(2000)
CALL apoc.load.json(c.resourceURI+”/comics?format=comic&formatType=comic&limit=100"+$suffix)
YIELD value
Just as with our initial load query, we start the query using the WITH
clause to set up and pass the timestamp and url suffix parameters that we will use further down. The next lines of code calls the familiar apoc.periodic.iterate
to pull all the characters in Neo4j (MATCH
statement). Notice the criteria starting from the WHERE
clause. We check the Character
nodes to see if the resourceURI
field contains a value. Marvel puts the url path for most entities in the resourceURI
field, so this is a simple check to see if the Character
has a url path for us to retrieve data. If it doesn’t, our call will fail, and it won’t find data.
* Hint: this is also a good way to trim the number of API calls. If we know a call will fail, then we should not waste precious resources on it. :)
The next criteria checks if a relationship type of INCLUDES
already exists for the node. This sees if we have already retrieved and inserted comics for a character. If a relationship exists, then we do not pull the comic info for that character again. This avoids duplicate calls for entities where we have already added that information.
Finally, we add a LIMIT 100
to that query to only pull 100 characters at a time from our Neo4j database. We ran across issues where queries would timeout because the server would stop responding. Marvel’s server instances probably have a timeout value to ensure users do not hog resources. Or, it could simply be that they need to bolster the architecture a bit to support heavier requests. ;)
Either way, we wanted to reduce the time taken to pull in batches of data, so my colleague suggested a LIMIT
clause to create smaller processing for each call. While this would increase the number of calls made to the database, it was better than larger batches failing frequently.
* Note: at this point, we had 26 calls to load all of the characters for each alphabet letter. That gave us around 1,000 characters in our Neo4j instance. If we pull 100 at a time, we could have a maximum of 11 batches of up to 100 calls (one for each character in a batch).
The second statement within the apoc.periodic.iterate
adds 2 seconds of sleep between calls to the API for each character. This tries to avoid the timeout of our Marvel server in the middle of one of our calls. Once we wait, we use the apoc.load.json
to hit the API endpoint for the comics pertaining to that character from the resourceURI
field on our Character
nodes. Again, we yield back the JSON object (YIELD value
).
//Second section
WITH c, value.data.results as results
WHERE results IS NOT NULL
UNWIND results as result
MERGE (comic:ComicIssue {id: result.id})
ON CREATE SET comic.name = result.title,
comic.issueNumber = result.issueNumber,
comic.pageCount = result.pageCount,
comic.resourceURI = result.resourceURI,
comic.thumbnail = result.thumbnail.path +
”.” + result.thumbnail.extension
WITH c, comic, result
MERGE (comic)-[r:INCLUDES]->(c)
To start the next code paragraph, we check that the subsection containing our comics is not null
, and unwind the object to get the subsection. Just as we did with the loading of characters, we use MERGE
on the comic id to find or create the ComicIssue node and the ON CREATE SET
clause to set property values if the node is created. The one thing that is different is the WITH
statement and the MERGE
line after it. Because we need to use the newly-created node and create a relationship between it and a Character
node, we need to pass the ComicIssue
node to the next MERGE
statement. We do this using WITH
.
* Note: The creation of the node and the relationship between a comic issue and a character is not done in a single
MERGE
because Cypher would merge on the entire pattern. If both nodes (with all properties) AND the relationship do not exist, then Cypher creates the entire pattern new, often causing duplicates. To avoid this, you need to match any existing entities, then create a new relationship or node using the existing information.
//Third section on Series, Creators, Stories, Events
WITH c, comic, result WHERE result.series IS NOT NULL
UNWIND result.series as comicSeries
MERGE (series:Series {id: toInt(split(comicSeries.resourceURI,”/”)[-1])})
ON CREATE SET series.name = comicSeries.name,
series.resourceURI = comicSeries.resourceURI
WITH c, comic, series, result
MERGE (comic)-[r2:BELONGS_TO]->(series)WITH c, comic, result, result.creators.items as items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (creator:Creator {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET creator.name = item.name,
creator.resourceURI = item.resourceURI
WITH c, comic, result, creator
MERGE (comic)-[r3:CREATED_BY]->(creator)WITH c, comic, result, result.stories.items as items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (story:Story {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET story.name = item.name,
story.resourceURI = item.resourceURI,
story.type = item.type
WITH c, comic, result, story
MERGE (comic)-[r4:MADE_OF]->(story)WITH c, comic, result, result.events.items AS items
WHERE items IS NOT NULL
UNWIND items as item
MERGE (event:Event {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET event.name = item.name,
event.resourceURI = item.resourceURI
MERGE (comic)-[r5:PART_OF]->(event)’,
The next paragraphs of code follow a similar pattern for each of the Series
, Creator
, Story
, and Event
node labels. One thing that is different is setting the id
field for each of those nodes. The toInt(split(item.resourceURI,"/")[-1])
code trims out the id from the url path that Marvel gives for each entity with split(item.resourceURI,"/")[-1])
and then converts that value to an integer for the id field in Neo4j using the toInt()
function.
The rest of the syntax should be familiar from our load of characters earlier. Now we have some of each entity in the database, but I want to add a few more details to a couple of node labels.
Filling in More Details
If you take another look at our last Cypher block, you can see that the Series
, Creator
, Story
, and Event
node types only have basic information. Each has an id
field, as well as the resourceURI
and name
fields. This doesn’t give us as much info as Marvel offers, and we could run some interesting queries on some of the other fields provided.
For our project, the only additional fields I wanted were on the Series
and Event
nodes for some images and start and end dates. Let’s go ahead and get those imported to Neo4j!
These statements are much simpler than the previous two, and we can use the same syntax and logic from our previous queries. The two queries to load these details are shown below, along with brief explanation.
//load any extra Series data
WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS ts
WITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffixCALL apoc.periodic.iterate(
‘MATCH (s:Series) WHERE s.resourceURI IS NOT NULL
AND not exists(s.startYear) RETURN s LIMIT 100’,
‘CALL apoc.util.sleep(2000)
CALL apoc.load.json(s.resourceURI+”?limit=100" + $suffix)
YIELD valueWITH value.data.results as results
WHERE results IS NOT NULL
UNWIND results as result
MERGE (series:Series {id: result.id})
SET series.startYear = result.startYear,
series.endYear = result.endYear,
series.rating = result.rating,
series.thumbnail = result.thumbnail.path +
”.” + result.thumbnail.extension’,{batchSize: 20, iterateList: false, params: {suffix:suffix}});//load any extra Event data
WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS ts
WITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffixCALL apoc.periodic.iterate(
‘MATCH (event:Event) WHERE event.resourceURI IS NOT NULL
AND NOT exists(event.start) RETURN DISTINCT event LIMIT 100’,
‘CALL apoc.util.sleep(2000)
CALL apoc.load.json(event.resourceURI+”?limit=100"+$suffix)
YIELD valueUNWIND value.data.results as result
MERGE (e:Event {id: result.id})
SET e.start = result.start,
e.end = result.end’,{batchSize: 20, iterateList:false, params: {suffix:suffix}});
Both of these queries use the WITH
clause to set up the timestamp and suffix parameters, just as we did with our other queries. The next code block in each query unwinds the JSON object and finds the nested data object, and uses MERGE
to find or create the nodes. This time, we just use the SET
clause to update or add any values that may not have existed before. Finally, the last code line sets the configuration parameters.
What I Learned
We now have plenty of data in our Neo4j graph database, and we can start to run some queries to check out the information that is there!
After much trial, error, and troubleshooting, the queries given and explained above and in the Part 2 post are what I used to get the data from a finicky API hosted by Marvel into my local instance of Neo4j as a graph data model. Below is a list of my key takeaways from the total data import process up to this point.
- It took a lot of messed up Cypher queries and clearing out the database multiple times to start the import over from a clean slate.
- It took several questions to expert teammates to tweak and optimize what I wanted to get and determine that some issues were related to the API on Marvel’s end and not poorly constructed Cypher on my end.
- I learned a LOT from an actual data set. I wasn’t playing with a cherry-picked data set that was clean and road blocks removed for easy learning. I had to come up with ways around new, inventive issues based on the data set I was dealing with.
- I was surprised how long the import step took. It wasn’t the actual import execution that was problematic. It was determining how I wanted the data to look and how to craft the Cypher to meet that expectation.
- Until you deal with a data set and technology hands-on, it is hard to break past a certain point of learning. I could read and read and read, but I needed to actually work through these problems and see what my queries did with the data to comprehend how the procedures worked and what the data looked like in Neo4j.
Next Steps
While the initial stages of this project may not have been overly glamorous, they are necessary to every project you encounter, whether personal or occupational. In all honesty, I enjoyed the learning process in these steps.
The next posts will start looking at crafting the application and how to use Spring Data Neo4j to produce powerful, capable, and pretty code and webpage! Stay tuned!
Resources
- Part 1 of this blog series
- Part 2 of this blog series
- Follow the duo on Twitter to see what’s coming: @mkheck and @jmhreif
- Neo4j Data Import
- APOC standard library
- Download Neo4j
- Spring Data Neo4j docs
- Spring Data Neo4j Guide