Flights Search Application with Neo4j — Dockerizing (Part 1)

How to build Neo4j Docker Image with Database import using neo4j-admin import tool

Target

In this series of articles, I will share my experience of building a simple web application that you can use to search for flights. The application was built for Neo4j Workshop in Bangkok, November 5, 2019.

The general idea is to have a page with a search box where you can define a place of departure and destination, choose the date and voila — a list of available flights would appear below. Our needs limited by only a one-way flight, no class options, and no other fancy options. Brutal one-way flight for MVP. Solving this task includes doing one by one smaller sub-tasks.

Flights Search Application is a perfect use-case for building using GRANDstack framework and the “how to do it” will be covered in the third part of this series.

Flights Search context is based on the graph-oriented problem: traverse through all possible routes to find all matched paths. We will cover Domain Model exploration, Cypher querying, and basic query performance improvements in future articles.

In this article, I will share an example of how to build Neo4j Docker Image with a Flights Database. Data Import will be achieved by using the neo4j-admin import tool.

If you are planning to build an amazing application using the Neo4j technology stack, I hope this series of articles is going to be very useful to you.

Architecture for the Flights Search application

Plan

The more complex and graceful the plan, the more likely it will fail. (Murphy’s Law)

Docker Image

One of the most critical features for effective development is easy setup and fast development loop: build, run, test. This is why Docker was chosen as an approach to build and run all system modules.

Database Schema

I do not plan to reinvent the wheel in Flights area, this is why I decided to follow Max De Marzi’s article about the Flights Search data modeling and reuse his Database Schema.

Data

Initial data of Airlines, Airports, and Routes can be downloaded from openflights.org — a resource of worldwide flight information.

Data Import

The Database schema requires many more entities asides from what is provided by the openflights.org website. Additionally, to simulate a search the application will need additional data, for example, for future flights one month ahead. Running tons of “LOAD CSV” does not seem to be the best strategy for dealing with such imports.

When the size of the import data is very big, it is better to use the neo4j-admin import tool. The neo4j-admin import tool allows you to do an offline data import at great speed. Once the data is imported, you can start the database and set and generate indexes. Data Import can be part of Docker Image build.

One More Application

Using the neo4j-admin import tool requires us to prepare specially formatted .csv files. So, I will create a file generator application. It will prepare all nodes and relationships in a .csv file format that can be used by the neo4j-admin import tool. I will use dotnet core console app written in C#, but you can go with any other programming language.

One More Docker Image

With data in the database, I can now write a query to achieve our main goal — finding a list of flights. I know for sure that we’ll want to have APOC on board to facilitate the application.

So I thought, why not prepare the “all-plugins-installed” Docker Image first, and then build the Flights Docker Image on top of it. This Custom Neo4j Docker Image could also be reused in any other future development.

Design

The picture below illustrates how all the pieces of this puzzle are related to each other. So, in the end, we can build a Docker Image with Flights Database data inside.

Building a Docker Image with Flights Database inside

It looked like a challenge from the beginning. But, maybe you already know — I love challenges. Let’s talk about the implementation step by step and see what we can learn from this lesson.

Customized Neo4j Docker Image

For the powerful version of Docker Image, it is better to build on top of the latest version of an Official Neo4j Docker Image (I decided to use last stable 3.5.x version). Now let’s add the first useful plugin: Awesome Procedures On Cypher (APOC). APOC contains tons of must-have procedures that are very useful for different kinds of queries.

Plugin installation is actually nothing more than a downloading of a particular .jar file to the /plugins folder and minimal configurations. The available release versions of plugin you can find in GitHub.

Dockerfile

FROM neo4j:3.5.12ENV APOC_VERSION=3.5.0.5ENV APOC_URI=https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/${APOC_VERSION}/apoc-${APOC_VERSION}-all.jarADD --chown=neo4j:neo4j ${APOC_URI} pluginsENV NEO4J_AUTH=neo4j/testCMD [ "neo4j" ]

The ENV NEO4J_AUTH=none statement will remove authorization. You should not normally do this, but it is ok for the development phase on the local machine. It can be handy.

docker build . -t=vladbatushkov/neo4j-apoc:dev

Replace my name with your docker hub username (if you plan to push it to the public hub) and enjoy your first Custom Neo4j Image.

docker run -it --rm -p 7474:7474 -p 7687:7687 vladbatushkov/neo4j-apoc:dev

For some useful Neo4j Docker basics, you can read Developer Guide.

Go to your http://localhost:7474/ and try the availability of APOC. For example, you can list all APOC functions or call any:

CALL apoc.help("apoc");
CALL apoc.coll.sum([1,2,3]);

There is another one way to achieve the same result but much more simply. APOC plugin is ready ready to use by NEO4JLABS_PLUGINS environment option.

Dockerfile

FROM neo4j:3.5.12ENV NEO4JLABS_PLUGINS=’["apoc"]’ENV NEO4J_AUTH=neo4j/testCMD [ "neo4j" ]

It seems like my idea to build a customized Neo4j Docker Image is not a big deal, we can install any plugin with one line. List of plugins ready to use:

Dockerfile

FROM neo4j:3.5.12ENV NEO4JLABS_PLUGINS='["apoc", "graph-algorithms", "graphql"]'ENV NEO4J_dbms_unmanaged__extension__classes=org.neo4j.graphql=/graphqlENV NEO4J_AUTH=neo4j/testCMD [ "neo4j" ]

Magic, is not it? A super powerful and simple Docker Image with three plugins:

docker build . -t=vladbatushkov/neo4j-apoc-algo-graphql:dev

In case something goes wrong and plugins from the box do not work as expected, you always can do all the necessary things in the “old-fashion” way. Just know, that there is no magic.

Dockerfile

FROM neo4j:3.5.12ENV APOC_VERSION=3.5.0.5ENV APOC_URI=https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/${APOC_VERSION}/apoc-${APOC_VERSION}-all.jarENV ALGO_VERSION=3.5.4.0ENV GRAPH_ALGORITHMS_URI=https://github.com/neo4j-contrib/neo4j-graph-algorithms/releases/download/${ALGO_VERSION}/graph-algorithms-algo-${ALGO_VERSION}.jarENV GRAPHQL_VERSION=3.5.0.4ENV GRAPHQL_URI=https://github.com/neo4j-graphql/neo4j-graphql/releases/download/${GRAPHQL_VERSION}/neo4j-graphql-${GRAPHQL_VERSION}.jarADD --chown=neo4j:neo4j ${APOC_URI} plugins
ADD --chown=neo4j:neo4j ${GRAPH_ALGORITHMS_URI} plugins
ADD --chown=neo4j:neo4j ${GRAPHQL_URI} plugins
ENV NEO4J_dbms_unmanaged__extension__classes=org.neo4j.graphql=/graphqlENV NEO4J_AUTH=neo4j/testCMD [ "neo4j" ]

Let’s do a small test-drive to confirm that GraphQL works properly. Register one dummy type and fetch the data directly from Neo4j via GraphQL API endpoint using any fancy GraphQL client.

CALL graphql.idl("type Person { name: String! }");
CREATE (:Person { name: "me" });

If you use neo4j/test credentials, then do not forget to add Auth Http Header:
{ “Authorization” : “Basic bmVvNGo6dGVzdA==” }

Data Import

I got daily graph building experience during my One Month Graph Challenge. 30 times I successfully answered the question “Where do I get the data?” and now I can tell you the truth: data is the most critical part of indie projects. And we as creators have several options to get the data:

  1. Manual generation. Do not expect anything serious with this approach.
    Example: Map of Cities
  2. Random generation. It can fit for some dummy and simple use-cases.
    Example: Random generated Galaxy
  3. Parsing / Scraping. The amount of work can vary from small APOC function to big Python script. You might face some weird issues and heroically solve them.
    Example: Bands and Genres
  4. Use public resources like API or file storage. Very promising option, but usually required structure changes, merge or some other modifications before import.
    Example: Chess.com API

In this project, we will break through the complexity of option number four. No pain, no gain.

Database Schema

Database Schema of this project is a copy of Max De Marzi Database Schema from Flight Search article. I highly recommend you to read his blogs from time to time, it is a great example of high-level expertise, full of creative ideas and interesting topics. Let’s look at the nodes and relationships.

Max De Marzi original content. Flight Search POC Database Schema.

For the answer to the question “Why it’s like that?” you’d better read in his blog post. For now, we will not discuss Data Modeling at depth (it is a topic for the second part of my series). Rather, I want to focus your attention on the technical solution of importing massive data, that should fit Database Schema.

Initial Data

Flight domain is a popular area and I found wonderful web resource, that provides enough amount of initial data: openflights.org.

Useful data from openflights includes these three .csv files:

~6162 airlines.csv
{ AirlineId, Name, Alias, IATA, ICAO, Callsign, Country, Active }
~7699 airports.csv
{ AirportId, Name, City, Country, IATA, ICAO, Latitude, Longitude, Altitude, TZ_UTCHoursOffset, DST, DST, TZ_Olson, Type, Source }
~67664 routes.csv
{ Airline, AirlineId, Source_Airport, Source_AirportId, Destination_Airport, Destination_AirportId, Codeshare, Stops, Equipment }

Data Discovery

With this data, we can already build Airports nodes and relationships between them. Airports nodes almost never change, the number of Airport nodes does not depend on time, they are rarely added or removed from the Database. Routes between them also extremely stable. If we have at least one flight between 2 Airports, then we have FLIES_TO relationship between them.

The file routes.csv contains information about the Airline and 2 connected Airports, this means we can build a Flight node from it. How to utilize Airline nodes? Well, we can add OPERATED_BY relationship between node Flight and node Airline.

Airports and Airlines is stable information, it change rarely. There is also very active data in the Database, and representation for it that should be created for every date: AirportDate nodes and Flight nodes. Every Airport is connected with AirportDay node for each date. AirportDay is connected with all outgoing or incoming Flight nodes. By the direction of the *_FLIGHT relationship, we know where the Flight is going to.

For example, if we have 2 connected Airports and 1 Flight operated in 2 days, we need to insert 2 AirportDay nodes for each Airport and 1 Flight for each pair of AirportDay nodes.

By having an understanding of the nature of the data, you can plan how to build all necessary amount of nodes and relationships.

We need to have some operating days in the Database, so let's make an assumption that all of existing flights is operated every day. For example, 1 way directed route between 2 Airports for 30 days gives us the next nodes and relationships:

1 Airline => 1 Airline node
2 Airports => 2 Airport nodes (BKK, SVO)
1 Route => 1 FLIES_TO relationship (BKK --> SVO)
2 Airports * 30 days => 60 AirportDay nodes
60 AirportDays => 60 HAS_DAY relationships
1 Route * 30 days => 30 Flight nodes
30 Flights => 60 BKK_FLIGHT relationships (in and out directed)
1 airline 30 Flights => 30 OPERATED_BY relationships

Now we can say, that the “business logic” for Flights Database generation is defined. We don’t really need all this math, but we need to know how the data-generation console application will work. But before coding a generator application, we need to understand how output files should be formatted. Files produced by generator should be ready to use with the neo4j-admin import tool.

How to use the neo4j-admin import tool

Here a few words about the tool, so you don’t need to click on the documentation link just yet. The import tool consumes .csv files with a specific structure, and imports thousands of nodes and relationships into the graph in just seconds.

Command-line interface

neo4j-admin import /
--database=flights.db / #database to import
--mode=csv / #file format
--nodes=... / #files to import nodes
--relationships=... / #files to import relationships
--ignore-missing-nodes #some settings

You can choose between the single-file mode and multi-file mode to import your data.

Single-file mode

Nodes

airlines.csvcode:ID,name:STRING,country:STRING
SU,Aeroflot Russian Airlines,Russia
Command-line interface--nodes:Airline="airlines.csv"

All nodes are marked with the Airline label. :ID is a unique ID and is used across the whole import process. It is not the ID for your future nodes or relationships. It is important to ensure that no two entities of nodes or relationships have the same ID during import.

Relationships

fliesTo.csv:START_ID,distance:INT,:END_ID
BKK,7111,SVO
Command-line interface--relationships:FLIES_TO="fliesTo.csv"

All relationships have FLIES_TO type. :START_ID is node ID of an outgoing node, while :END_ID is node ID of an incoming node. This snippet shows an example of BKK → SVO. All relationship properties are in the middle, here we have only one — distance.

Multi-file mode

Nodes

flights_header.csvflightNumber:ID,departs:DATETIME,duration:STRING,distance:INT,price:INTflights_data_20191201.csv
flights_data_20191202.csv
flights_data_20191203.csv and more
SU_BKK_SVO_20191201,175200.000+0700,P09H09M,7111,16554Command-line interface--nodes:Flight="flights_header.csv,flights_data_.*"

The header file contains only a declaration of columns, while all the data placed in separate files or one data-file. This approach very useful when you have a lot of data to import and it makes sense to keep these files in reasonable size limits.

Relationships

hasDay_header.csv:START_ID,:END_IDhasDay_data_20191201.csv
hasDay_data_20191202.csv
hasDay_data_20191203.csv and more
BKK,BKK_20191201Command-line interface--relationships:HAS_DAY="hasDay_header.csv,hasDay_data_.*"

Example with Relationship Type declaration inside Data file

If you want to have different Labels for your nodes or Types for your relationships, you can try another technique. You can define a node Label or relationship Type along with other properties in each row of your data to be imported.

inFlight_header.csv:START_ID,:END_ID,:TYPEinFlight_data_20191201.csv
inFlight_data_20191202.csv
inFlight_data_20191203.csv and more
BKK_20191201,SU_BKK_SVO_20191201,BKK_FLIGHT
BLQ_20191201,SU_BLQ_SVO_20191201,BLQ_FLIGHT
Command-line interface--relationships="inFlight_header.csv,inFlight_data_.*"

The same is valid for nodes with Labels, using :LABEL statement instead of :TYPE.

If you want to know more about other settings and commands then check out the documentation.

Import Overview

Graph Schema to import

The Flights Database files generation is not a trivial task. We will need to generate 9 separate chunks of import data: Airport, AirportDay, Flight, Airline, FLIES_TO, HAS_DAY, OPERATED_BY and *_FLIGHT in both directions. And to remind you, some of that data is dependent on dates, so it is also required to prepare separate headers and data files for each date. Flight price and duration of flight are both heuristic guesses based on distances between airports. You can imagine how strongly I exhaled when I finally wrote a generator application!

I think the generator application code and the application itself are optional topics and are not discussed in this article. If you feel you will be faced with a problem to write an import file generator, choose any programming language you know and simply write a generator.

Full dockerized example of Flights Database Import script

Flights Docker Image

Congrats! You are at the final base: how to build a Docker Image with imported Database inside.

The idea behind this task is a super simple story:

  1. Copy all .csv files into docker image
  2. Copy the import script into docker image
  3. Set up the execution of the import script at the containers launch
  4. Select the new database as an active one

Dockerfile

FROM vladbatushkov/neo4j-apoc-algo-graphql:latestCOPY import/*.csv import/COPY import.sh import.shENV EXTENSION_SCRIPT=import.shENV NEO4J_dbms_active__database=flights.dbCMD [ "neo4j" ]

import.sh

The EXTENSION_SCRIPT allowed us to define an import script that will be executed before Neo4j starts.

docker run -p 7474:7474 -p 7687:7687 -v c:/neo4j/data:/data -v c:/neo4j/logs:/logs vladbatushkov/neo4j-flights:dev

Volume params are optional. It just helps you to build several containers using the same database data, avoiding the execution of import scripts.

Both Docker Images from this article at available from my Docker Hub page.

I bet you are interested to see how the flow of the import works! How fast is the import? How many nodes and relationships will there be in the Flights Database for one year ahead? What is the approximate size of the database? Will all of these things actually work?! Well, let’s try a small benchmark test and look at the numbers.

Import Benchmarking

My docker environment is going to be the same for every import.

Available resources:
Total machine memory: 9.73 GB
Free machine memory: 8.83 GB
Max heap memory : 2.16 GB
Processors: 4
Configured max memory: 6.81 GB
High-IO: false

I want to compare One-day import (1 of January 2020) VS One-month import (January 2020) VS One-year import (2020).

For example, here you can see the contents of a folder with one-day data ready for import. All files with a name pattern like *_data_YYYYMMDD are going to be created for every import day.

One-Day Import Results

IMPORT DONE in 3s 943ms.
Imported:
74 057 nodes
226 113 relationships
386 217 properties
Size: 19.53 MiB
Query result based on One-Day Import
MATCH (a:Airport)-->(ad1:AirportDay)--(f:Flight)--(ad2:AirportDay)<--(b:Airport)
MATCH (f)-->(al:Airline)
WHERE a.city = "Bangkok" AND b.city = "Moscow"
RETURN *

One-Month Import Results

IMPORT DONE in 30s 167ms.
Imported:
2 101 007 nodes
5 977 023 relationships
9 861 087 properties
Size: 479.99 MiB
Query result based on One-Month Import

One-Year Import Results

IMPORT DONE in 5m 39s 912ms.
Imported:
24 735 282 nodes
70 195 518 relationships
115 663 802 properties
Size: 5.49 GiB
Query result based on One-Year Import

End of Part 1

Thanks for reading!

In the next article of this series, I plan a dedication to querying the Neo4j Database, and going deep into the details on how to write a Cypher query searching for flights. Stay in touch and clap-clap-clap.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store