Exploring the U.S. National Bridge Inventory with Neo4j — Part 2: Importing and Storing Raw Data in Neo4j

Published in

Neo4j Developer Blog

9 min readSep 6, 2020

Moving forward, this series assumes a basic understanding of graph databases and Neo4j. In a nutshell, Neo4j is a labeled property graph database that consists of Nodes, Labels, Directional Relationships, and Properties. You can learn more about graph databases and Neo4j here.

Read the introduction (Part 1).

NBI Files

The National Bridge Inventory (NBI) is a database of all bridges in the United States, including the District of Columbia and Puerto Rico. There are two versions of files available: delimited and no delimiter. Our exploration will start with the delimited comma-separated values (CSV) file as it will make the importing much easier. Additionally, rather than download and store every file locally, I created a Google Sheet to manage all of the file URLs.

The decision of importing and storing the raw data in the graph itself has several pros and cons:

Pros:

Simplicity: Storing and managing the file URLs in a single document means managing a single file, rather than all the files for each year. Additionally, using the LOAD CSV Neo4j functionality allows us to import and create the :File nodes from that sheet. If a file is updated, then we need only to re-import it directly from the URL.
Speed: Importing and storing the raw data in the graph means only having to worry about importing CSV data once. Granted, the initial importation can be a bit slow depending on the throughput of your connection and the power of the machine you have running the import (For reference, I am running Neo4j Desktop on a 2019 MacBook Pro with 16GB RAM and 500 GB storage. I have configured 4GB heap and 4GB page cache for this version 4.1.1 database instance). It is faster to work and manipulate the data within the graph rather than relying on importing data each time new raw data needs to be read and added to the graph structure. Additionally, if at any point a large scale restructuring of the schema is needed, all of the data are readily available.
Traceability: As we will see later in this post, we can connect the files and rows to more easily track how data have changed with each annual report.

Cons:

Storage: By far the biggest and only real drawback I have encountered thus far, is storing the raw data in the graph as it takes up A LOT of space. As of this post, I have imported the raw data for all delimited files for every state from 1992 to 2019.

So now we decide on a schema and import!

Importing Raw Data

Deciding on the graph structure for the raw data is pretty straight forward. Following the Neo4j principles we simply take the statement “Each file contains rows of data, where each row represents a single bridge.” Converting that to cypher we get (:File)-[:CONTAINS]->(:Row). We set additional properties on the :File and :Row nodes that make them unique.

We provide additional context for each node by assigning properties. For the :File nodes we’ll assign the properties name (the name of the file), url (the URL where the file is stored on the NBI website), folder (this property, in addition to the label, designates this file node as containing delimited data), year (the year of the data reported), and createdOn (to signify when the data row was added to the graph). Similarly, since we are using CSV files that contain headers, we utilize the mapping functionality to assign the properties from the files.

In addition to the properties mapped from the CSV file, we add two additional properties: file (which we set to be equal to the name property on the :File node, and will be how we will connect the :Row to its :File), and createdOn (to signify when the data was added to the graph).

Given the large number of rows to import we break the import, node creation, and relationship connections into three steps. This helps us address eager queries:

Step1: File Node Importing

// Create nodes for each file URL (Delimited Files)
// Data loaded from https://www.fhwa.dot.gov/bridge/nbi/ascii.cfm via define URLs stored in Google Sheet
LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318" AS row
WITH row
WHERE NOT row.Year IS NULL
WITH row, row.URL AS fileUrl
MERGE (file:File:DelimitedFile {url: fileUrl})
ON CREATE SET file.url = fileUrl,
              file.folder = row.Folder,
              file.name = row.File,
              file.year = toInteger(row.Year),
              file.createdOn = datetime(), 
              file:ImportFileRows

Step 2: Row Node Importing

For this step, we are using APOC for apoc.periodic.iterate:

// Iterate through files and import rows (Delimited)
MATCH (file:ImportFileRows)
REMOVE file:ImportFileRows
WITH collect([file.url,file.name]) AS files
UNWIND files AS file
CALL apoc.periodic.iterate(
'
CALL apoc.load.csv($url,{header:true, quoteChar:"\u0000"}) YIELD map AS row
RETURN row
','
CREATE (fileRow:Row:DelimitedRow)
SET fileRow += row,
    fileRow.file = $name,
    fileRow.createdOn = datetime(),
    fileRow:ConnectRowToFile
',
{batchSize:10000, parallel:false, params:{url: file[0], name: file[1]}}) YIELD batches, total, timeTaken, committedOperations, failedOperations, failedBatches, retries, errorMessages, batch, operations, wasTerminated, failedParams
RETURN batches, total, timeTaken, committedOperations, failedOperations, failedBatches, retries, errorMessages, batch, operations, wasTerminated, failedParams

Step 3: Connecting File Node to Row Node

// Connect rows to files (Delimited)
CALL apoc.periodic.iterate(
'
MATCH (row:ConnectRowToFile)
RETURN row
','
REMOVE row:ConnectRowToFile
WITH row
MATCH (file:DelimitedFile)
WHERE file.name = row.file
MERGE (file)-[:CONTAINS]->(row)
',
{batchSize:10000,parallel:false}) YIELD batches, total
RETURN batches, total

Notes:

There are some additional ETL tools available that make importing all of this data a bit quicker when starting with a fresh database. I chose to focus on writing the Cypher intended to function as a way of processing new files and data as they are released. That way the Cypher can easily be reused as needed.
The base nodes that will be referenced are :File and :Row. However, in the three code snippets each :File and :Row nodes are also assigned additional labels: :DelimitedFile and :DelimitedRow, respectively. These come into play in a later post. For now we’ll use :File in lieu of :DelimitedFile, and :Row in lieu of :DelimitedRow.
Two other “temporary” labels can also be spotted in the code snippets above: :ImportFileRows and :ConnectRowToFile. These labels are used to help process the data in steps (briefly discussed in the next section).

Breaking the importation into multiple steps stems from some lessons I have learned in my career using Neo4j with large datasets that are added to on a regular basis:

Breaking Cypher into smaller chunks allows for more focused tuning where needed. Big, long Cypher statements are quite beautiful to look at, but can be challenging to improve over time. By dividing the processing of data into smaller Cypher statements, it is possible to more easily manage importation and manipulation of data.
It is possible to create a pattern of importing and manipulation that is easy to replicate. In our exploration of the bridge data, new data are added annually. Similarly, I have worked with new data that must be added on a more regular basis. In situations where data are constantly being added to the graph, using a series of steps to process data allows for more fine tuned control of data processing.
Using “temporary” labels to process the data makes it easier to track what data are being processed. This can be valuable in tracking metric performance of queries as you build.

Now that we have run these three queries we have a graph structure that looks like:

The graph currently contains the raw data from all delimited files from 1992 to 2019. We have 1,456 :File nodes, and 18,579,580 :Row nodes and :CONTAINS relationships each. That is a total of 18,581,36 nodes and 18,579,580 relationships equating to approximately 122 GB of data!!!

The next two steps we’ll take in adding to our graph model come down to preference. Due to the flexibility of Neo4j and usability of Cypher, we can query and return :File and :Row nodes according to their properties, sorting and filtering as needed. However, for this exploration, explicitly adding in relationships between each related file and each associated row provides a good visual of how bridge conditions progress over time. Additionally, given the number of :Row nodes connected to each :File node, it becomes easier to find all connected :Row nodes. The only drawback to consider is that these relationships take up more storage. For me, that is a price I am willing to pay.

ONWARD!

Connecting Successive Files

To model the files in successive order we want to establish what the NEXT FILE will be from any other file. We need to utilize two of the five properties on the :File node, year and name. It is important to only connect files that represent data from the same state. For that we’ll use the first two letters from the name property, since the name properties all follow the same naming convention. For example “AZ18.txt” and “AZ19.txt” are both files for Arizona, whose state abbreviation is AZ. Therefore, we can find all associated files by the first two letters for the property name. Therefore using name and year we connect :File nodes using the following cypher:

// Create (:File)-[:NEXT_FILE]->(:File) for each state in ascending order
MATCH (file:File)
WITH DISTINCT left(file.name,2) AS stateAbbr
WITH stateAbbr
ORDER BY stateAbbr
MATCH (df:DelimitedFile)
WHERE left(df.name,2) = stateAbbr
WITH stateAbbr, df
ORDER BY df.year ASC
WITH stateAbbr, collect(df) AS fileArray
UNWIND range(0,size(fileArray)-2,1) AS index
WITH fileArray[index] AS start, fileArray[index+1] AS end
MERGE (start)-[:NEXT_FILE]->(end)

Now we have a chain connecting :File nodes for each state in ascending order by year:

Encoding Document

Before we connect rows, let’s pause and take a quick look at the Recording and Coding Guide for the Structure Inventory and Appraisal of the Nations Bridges. This document is the key to decoding and understanding the data stored on each :Row node. The document provides an explanation for the values stored on each property. It also provides the decoding information to convert the data to more usable information. For example, condition ratings are reduced down to a single alphanumeric character in the files. The corresponding description of this single alphanumeric character provides additional context that helps guide our analysis. We’ll dive into this more in successive posts.

Connect Successive Rows

In order to connect rows we’ll just focus on one of the properties stored on the :Row node, STRUCTURE_NUMBER_008, as it is unique for every bridge within a particular state. Since we have already connected the :File nodes in order, we’ll use those relationships to connect the related rows in the same order as well. Because we are learning as we go, in hindsight, this could be a good place to consider a “temporary” label to help process the data. For the time being, we’ll use what we have and run the cypher to connect the rows:

// Create (:Row)-[:NEXT_ROW]->(:Row) by state for each record in ascending order by file year
MATCH (file:File)
WITH DISTINCT left(file.name,2) AS stateAbbr
WITH stateAbbr
ORDER BY stateAbbr
MATCH (df:DelimitedFile)
WHERE left(df.name,2) = stateAbbr
WITH stateAbbr, df
ORDER BY df.year ASC
WITH stateAbbr, collect(df) AS fileArray
UNWIND range(0,size(fileArray)-2,1) AS index
WITH fileArray[index].name AS startFile, 
     fileArray[index+1].name AS endFile
CALL apoc.periodic.iterate(
'
MATCH (:DelimitedFile {name: $startFile})-[:CONTAINS]->(startRow:ConnectRowToNextRow)
MATCH (:DelimitedFile {name: $endFile})-[:CONTAINS]->(endRow:ConnectRowToNextRow)
WHERE endRow.STRUCTURE_NUMBER_008 = startRow.STRUCTURE_NUMBER_008
RETURN startRow, endRow
','
MERGE (startRow)-[:NEXT_ROW]->(endRow)
',
{batches:1000, parallel:false, params:{startFile: startFile, endFile: endFile}}) YIELD batches, total, timeTaken, committedOperations, failedOperations, failedBatches, retries, errorMessages, batch, operations, wasTerminated, failedParams
RETURN batches, total, timeTaken, committedOperations, failedOperations, failedBatches, retries, errorMessages, batch, operations, wasTerminated, failedParams