Image for post
Image for post
Aditya Vyas on Unsplash

Exploring the U.S. Nation Bridge Inventory with Neo4j — Part 2: Importing and Storing Raw Data in Neo4j

Michael McKenzie
Sep 6 · 9 min read

Moving forward, this series assumes a basic understanding of graph databases and Neo4j. In a nutshell, Neo4j is a labeled property graph database that consists of Nodes, Labels, Directional Relationships, and Properties. You can learn more about graph databases and Neo4j here.

Read the introduction (Part 1).

NBI Files

The National Bridge Inventory (NBI) is a database of all bridges in the United States, including the District of Columbia and Puerto Rico. There are two versions of files available: delimited and no delimiter. Our exploration will start with the delimited comma-separated values (CSV) file as it will make the importing much easier. Additionally, rather than download and store every file locally, I created a Google Sheet to manage all of the file URLs.

Image for post
Image for post

The decision of importing and storing the raw data in the graph itself has several pros and cons:

Pros:

  • Simplicity: Storing and managing the file URLs in a single document means managing a single file, rather than all the files for each year. Additionally, using the Neo4j functionality allows us to import and create the nodes from that sheet. If a file is updated, then we need only to re-import it directly from the URL.
  • Speed: Importing and storing the raw data in the graph means only having to worry about importing CSV data once. Granted, the initial importation can be a bit slow depending on the throughput of your connection and the power of the machine you have running the import (For reference, I am running Neo4j Desktop on a 2019 MacBook Pro with 16GB RAM and 500 GB storage. I have configured 4GB heap and 4GB page cache for this version 4.1.1 database instance). It is faster to work and manipulate the data within the graph rather than relying on importing data each time new raw data needs to be read and added to the graph structure. Additionally, if at any point a large scale restructuring of the schema is needed, all of the data are readily available.
  • Traceability: As we will see later in this post, we can connect the files and rows to more easily track how data have changed with each annual report.

Cons:

  • Storage: By far the biggest and only real drawback I have encountered thus far, is storing the raw data in the graph as it takes up A LOT of space. As of this post, I have imported the raw data for all delimited files for every state from 1992 to 2019.

So now we decide on a schema and import!

Importing Raw Data

Deciding on the graph structure for the raw data is pretty straight forward. Following the Neo4j principles we simply take the statement “Each file contains rows of data, where each row represents a single bridge.” Converting that to cypher we get . We set additional properties on the and nodes that make them unique.

We provide additional context for each node by assigning properties. For the nodes we’ll assign the properties (the name of the file), (the URL where the file is stored on the NBI website), (this property, in addition to the label, designates this file node as containing delimited data), (the year of the data reported), and (to signify when the data row was added to the graph). Similarly, since we are using CSV files that contain headers, we utilize the mapping functionality to assign the properties from the files.

In addition to the properties mapped from the CSV file, we add two additional properties: (which we set to be equal to the property on the node, and will be how we will connect the to its ), and (to signify when the data was added to the graph).

Given the large number of rows to import we break the import, node creation, and relationship connections into three steps. This helps us address eager queries:

Step1: File Node Importing

// Create nodes for each file URL (Delimited Files)
// Data loaded from https://www.fhwa.dot.gov/bridge/nbi/ascii.cfm via define URLs stored in Google Sheet
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318" AS row
WITH row
WHERE NOT row.Year IS NULL
WITH row, row.URL AS fileUrl
MERGE (file:File:DelimitedFile {url: fileUrl})
ON CREATE SET file.url = fileUrl,
file.folder = row.Folder,
file.name = row.File,
file.year = toInteger(row.Year),
file.createdOn = datetime(),
file:ImportFileRows

Step 2: Row Node Importing

For this step, we are using APOC for :

// Iterate through files and import rows (Delimited)
MATCH (file:ImportFileRows)
REMOVE file:ImportFileRows
WITH collect([file.url,file.name]) AS files
UNWIND files AS file
CALL apoc.periodic.iterate(
'
CALL apoc.load.csv($url,{header:true, quoteChar:"\u0000"}) YIELD map AS row
RETURN row
','
CREATE (fileRow:Row:DelimitedRow)
SET fileRow += row,
fileRow.file = $name,
fileRow.createdOn = datetime(),
fileRow:ConnectRowToFile
',
{batchSize:10000, parallel:false, params:{url: file[0], name: file[1]}}) YIELD batches, total, timeTaken, committedOperations, failedOperations, failedBatches, retries, errorMessages, batch, operations, wasTerminated, failedParams
RETURN batches, total, timeTaken, committedOperations, failedOperations, failedBatches, retries, errorMessages, batch, operations, wasTerminated, failedParams

Step 3: Connecting File Node to Row Node

// Connect rows to files (Delimited)
CALL apoc.periodic.iterate(
'
MATCH (row:ConnectRowToFile)
RETURN row
','
REMOVE row:ConnectRowToFile
WITH row
MATCH (file:DelimitedFile)
WHERE file.name = row.file
MERGE (file)-[:CONTAINS]->(row)
',
{batchSize:10000,parallel:false}) YIELD batches, total
RETURN batches, total

Notes:

  • There are some additional ETL tools available that make importing all of this data a bit quicker when starting with a fresh database. I chose to focus on writing the Cypher intended to function as a way of processing new files and data as they are released. That way the Cypher can easily be reused as needed.
  • The base nodes that will be referenced are and . However, in the three code snippets each and nodes are also assigned additional labels: and , respectively. These come into play in a later post. For now we’ll use in lieu of , and in lieu of .
  • Two other “temporary” labels can also be spotted in the code snippets above: and . These labels are used to help process the data in steps (briefly discussed in the next section).

Breaking the importation into multiple steps stems from some lessons I have learned in my career using Neo4j with large datasets that are added to on a regular basis:

  1. Breaking Cypher into smaller chunks allows for more focused tuning where needed. Big, long Cypher statements are quite beautiful to look at, but can be challenging to improve over time. By dividing the processing of data into smaller Cypher statements, it is possible to more easily manage importation and manipulation of data.
  2. It is possible to create a pattern of importing and manipulation that is easy to replicate. In our exploration of the bridge data, new data are added annually. Similarly, I have worked with new data that must be added on a more regular basis. In situations where data are constantly being added to the graph, using a series of steps to process data allows for more fine tuned control of data processing.
  3. Using “temporary” labels to process the data makes it easier to track what data are being processed. This can be valuable in tracking metric performance of queries as you build.

Now that we have run these three queries we have a graph structure that looks like:

Image for post
Image for post

The graph currently contains the raw data from all delimited files from 1992 to 2019. We have 1,456 nodes, and 18,579,580 nodes and relationships each. That is a total of 18,581,36 nodes and 18,579,580 relationships equating to approximately 122 GB of data!!!

The next two steps we’ll take in adding to our graph model come down to preference. Due to the flexibility of Neo4j and usability of Cypher, we can query and return and nodes according to their properties, sorting and filtering as needed. However, for this exploration, explicitly adding in relationships between each related file and each associated row provides a good visual of how bridge conditions progress over time. Additionally, given the number of nodes connected to each node, it becomes easier to find all connected nodes. The only drawback to consider is that these relationships take up more storage. For me, that is a price I am willing to pay.

ONWARD!

Connecting Successive Files

To model the files in successive order we want to establish what the NEXT FILE will be from any other file. We need to utilize two of the five properties on the node, and . It is important to only connect files that represent data from the same state. For that we’ll use the first two letters from the property, since the properties all follow the same naming convention. For example “AZ18.txt” and “AZ19.txt” are both files for Arizona, whose state abbreviation is AZ. Therefore, we can find all associated files by the first two letters for the property . Therefore using and we connect nodes using the following cypher:

// Create (:File)-[:NEXT_FILE]->(:File) for each state in ascending order
MATCH (file:File)
WITH DISTINCT left(file.name,2) AS stateAbbr
WITH stateAbbr
ORDER BY stateAbbr
MATCH (df:DelimitedFile)
WHERE left(df.name,2) = stateAbbr
WITH stateAbbr, df
ORDER BY df.year ASC
WITH stateAbbr, collect(df) AS fileArray
UNWIND range(0,size(fileArray)-2,1) AS index
WITH fileArray[index] AS start, fileArray[index+1] AS end
MERGE (start)-[:NEXT_FILE]->(end)

Now we have a chain connecting nodes for each state in ascending order by year:

Image for post
Image for post

Encoding Document

Before we connect rows, let’s pause and take a quick look at the Recording and Coding Guide for the Structure Inventory and Appraisal of the Nations Bridges. This document is the key to decoding and understanding the data stored on each node. The document provides an explanation for the values stored on each property. It also provides the decoding information to convert the data to more usable information. For example, condition ratings are reduced down to a single alphanumeric character in the files. The corresponding description of this single alphanumeric character provides additional context that helps guide our analysis. We’ll dive into this more in successive posts.

Connect Successive Rows

In order to connect rows we’ll just focus on one of the properties stored on the node, , as it is unique for every bridge within a particular state. Since we have already connected the nodes in order, we’ll use those relationships to connect the related rows in the same order as well. Because we are learning as we go, in hindsight, this could be a good place to consider a “temporary” label to help process the data. For the time being, we’ll use what we have and run the cypher to connect the rows:

// Create (:Row)-[:NEXT_ROW]->(:Row) by state for each record in ascending order by file year
MATCH (file:File)
WITH DISTINCT left(file.name,2) AS stateAbbr
WITH stateAbbr
ORDER BY stateAbbr
MATCH (df:DelimitedFile)
WHERE left(df.name,2) = stateAbbr
WITH stateAbbr, df
ORDER BY df.year ASC
WITH stateAbbr, collect(df) AS fileArray
UNWIND range(0,size(fileArray)-2,1) AS index
WITH fileArray[index].name AS startFile,
fileArray[index+1].name AS endFile
CALL apoc.periodic.iterate(
'
MATCH (:DelimitedFile {name: $startFile})-[:CONTAINS]->(startRow:ConnectRowToNextRow)
MATCH (:DelimitedFile {name: $endFile})-[:CONTAINS]->(endRow:ConnectRowToNextRow)
WHERE endRow.STRUCTURE_NUMBER_008 = startRow.STRUCTURE_NUMBER_008
RETURN startRow, endRow
','
MERGE (startRow)-[:NEXT_ROW]->(endRow)
',
{batches:1000, parallel:false, params:{startFile: startFile, endFile: endFile}}) YIELD batches, total, timeTaken, committedOperations, failedOperations, failedBatches, retries, errorMessages, batch, operations, wasTerminated, failedParams
RETURN batches, total, timeTaken, committedOperations, failedOperations, failedBatches, retries, errorMessages, batch, operations, wasTerminated, failedParams

Now that the nodes and nodes are connected we have an updated schema that looks like:

Image for post
Image for post

Up next…. !

Part1: Background
Part 3: Connecting Bridges to States

Neo4j Developer Blog

Developer Content around Graph Databases, Neo4j, Cypher…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store