Flights Search Application with Neo4j — Dockerizing (Part 1)
How to build Neo4j Docker Image with Database import using neo4j-admin import tool
Target
In this series of articles, I will share my experience of building a simple web application that you can use to search for flights. The application was built for Neo4j Workshop in Bangkok, November 5, 2019.
The general idea is to have a page with a search box where you can define a place of departure and destination, choose the date and voila — a list of available flights would appear below. Our needs limited by only a one-way flight, no class options, and no other fancy options. Brutal one-way flight for MVP. Solving this task includes doing one by one smaller sub-tasks.
Flights Search Application is a perfect use-case for building using GRANDstack framework and the “how to do it” will be covered in the third part of this series.
Flights Search context is based on the graph-oriented problem: traverse through all possible routes to find all matched paths. We will cover Domain Model exploration, Cypher querying, and basic query performance improvements in future articles.
In this article, I will share an example of how to build Neo4j Docker Image with a Flights Database. Data Import will be achieved by using the neo4j-admin import tool.
If you are planning to build an amazing application using the Neo4j technology stack, I hope this series of articles is going to be very useful to you.
Plan
The more complex and graceful the plan, the more likely it will fail. (Murphy’s Law)
Docker Image
One of the most critical features for effective development is easy setup and fast development loop: build, run, test. This is why Docker was chosen as an approach to build and run all system modules.
Database Schema
I do not plan to reinvent the wheel in Flights area, this is why I decided to follow Max De Marzi’s article about the Flights Search data modeling and reuse his Database Schema.
Data
Initial data of Airlines, Airports, and Routes can be downloaded from openflights.org — a resource of worldwide flight information.
Data Import
The Database schema requires many more entities asides from what is provided by the openflights.org website. Additionally, to simulate a search the application will need additional data, for example, for future flights one month ahead. Running tons of “LOAD CSV” does not seem to be the best strategy for dealing with such imports.
When the size of the import data is very big, it is better to use the neo4j-admin import tool. The neo4j-admin import tool allows you to do an offline data import at great speed. Once the data is imported, you can start the database and set and generate indexes. Data Import can be part of Docker Image build.
One More Application
Using the neo4j-admin import tool requires us to prepare specially formatted .csv files. So, I will create a file generator application. It will prepare all nodes and relationships in a .csv file format that can be used by the neo4j-admin import tool. I will use dotnet core console app written in C#, but you can go with any other programming language.
One More Docker Image
With data in the database, I can now write a query to achieve our main goal — finding a list of flights. I know for sure that we’ll want to have APOC on board to facilitate the application.
So I thought, why not prepare the “all-plugins-installed” Docker Image first, and then build the Flights Docker Image on top of it. This Custom Neo4j Docker Image could also be reused in any other future development.
Design
The picture below illustrates how all the pieces of this puzzle are related to each other. So, in the end, we can build a Docker Image with Flights Database data inside.
It looked like a challenge from the beginning. But, maybe you already know — I love challenges. Let’s talk about the implementation step by step and see what we can learn from this lesson.
Customized Neo4j Docker Image
For the powerful version of Docker Image, it is better to build on top of the latest version of an Official Neo4j Docker Image (I decided to use last stable 3.5.x version). Now let’s add the first useful plugin: Awesome Procedures On Cypher (APOC). APOC contains tons of must-have procedures that are very useful for different kinds of queries.
Plugin installation is actually nothing more than a downloading of a particular .jar file to the /plugins folder and minimal configurations. The available release versions of plugin you can find in GitHub.
Dockerfile
FROM neo4j:3.5.12ENV APOC_VERSION=3.5.0.5ENV APOC_URI=https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/${APOC_VERSION}/apoc-${APOC_VERSION}-all.jarADD --chown=neo4j:neo4j ${APOC_URI} pluginsENV NEO4J_AUTH=neo4j/testCMD [ "neo4j" ]
The ENV NEO4J_AUTH=none
statement will remove authorization. You should not normally do this, but it is ok for the development phase on the local machine. It can be handy.
docker build . -t=vladbatushkov/neo4j-apoc:dev
Replace my name with your docker hub username (if you plan to push it to the public hub) and enjoy your first Custom Neo4j Image.
docker run -it --rm -p 7474:7474 -p 7687:7687 vladbatushkov/neo4j-apoc:dev
For some useful Neo4j Docker basics, you can read Developer Guide.
Go to your http://localhost:7474/ and try the availability of APOC. For example, you can list all APOC functions or call any:
CALL apoc.help("apoc");
CALL apoc.coll.sum([1,2,3]);
There is another one way to achieve the same result but much more simply. APOC plugin is ready ready to use by NEO4JLABS_PLUGINS environment option.
Dockerfile
FROM neo4j:3.5.12ENV NEO4JLABS_PLUGINS=’["apoc"]’ENV NEO4J_AUTH=neo4j/testCMD [ "neo4j" ]
It seems like my idea to build a customized Neo4j Docker Image is not a big deal, we can install any plugin with one line. List of plugins ready to use:
Dockerfile
FROM neo4j:3.5.12ENV NEO4JLABS_PLUGINS='["apoc", "graph-algorithms", "graphql"]'ENV NEO4J_dbms_unmanaged__extension__classes=org.neo4j.graphql=/graphqlENV NEO4J_AUTH=neo4j/testCMD [ "neo4j" ]
Magic, is not it? A super powerful and simple Docker Image with three plugins:
docker build . -t=vladbatushkov/neo4j-apoc-algo-graphql:dev
In case something goes wrong and plugins from the box do not work as expected, you always can do all the necessary things in the “old-fashion” way. Just know, that there is no magic.
Dockerfile
FROM neo4j:3.5.12ENV APOC_VERSION=3.5.0.5ENV APOC_URI=https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/${APOC_VERSION}/apoc-${APOC_VERSION}-all.jarENV ALGO_VERSION=3.5.4.0ENV GRAPH_ALGORITHMS_URI=https://github.com/neo4j-contrib/neo4j-graph-algorithms/releases/download/${ALGO_VERSION}/graph-algorithms-algo-${ALGO_VERSION}.jarENV GRAPHQL_VERSION=3.5.0.4ENV GRAPHQL_URI=https://github.com/neo4j-graphql/neo4j-graphql/releases/download/${GRAPHQL_VERSION}/neo4j-graphql-${GRAPHQL_VERSION}.jarADD --chown=neo4j:neo4j ${APOC_URI} plugins
ADD --chown=neo4j:neo4j ${GRAPH_ALGORITHMS_URI} plugins
ADD --chown=neo4j:neo4j ${GRAPHQL_URI} pluginsENV NEO4J_dbms_unmanaged__extension__classes=org.neo4j.graphql=/graphqlENV NEO4J_AUTH=neo4j/testCMD [ "neo4j" ]
Let’s do a small test-drive to confirm that GraphQL works properly. Register one dummy type and fetch the data directly from Neo4j via GraphQL API endpoint using any fancy GraphQL client.
CALL graphql.idl("type Person { name: String! }");
CREATE (:Person { name: "me" });
If you use neo4j/test credentials, then do not forget to add Auth Http Header:{ “Authorization” : “Basic bmVvNGo6dGVzdA==” }
Data Import
I got daily graph building experience during my One Month Graph Challenge. 30 times I successfully answered the question “Where do I get the data?” and now I can tell you the truth: data is the most critical part of indie projects. And we as creators have several options to get the data:
- Manual generation. Do not expect anything serious with this approach.
Example: Map of Cities - Random generation. It can fit for some dummy and simple use-cases.
Example: Random generated Galaxy - Parsing / Scraping. The amount of work can vary from small APOC function to big Python script. You might face some weird issues and heroically solve them.
Example: Bands and Genres - Use public resources like API or file storage. Very promising option, but usually required structure changes, merge or some other modifications before import.
Example: Chess.com API
In this project, we will break through the complexity of option number four. No pain, no gain.
Database Schema
Database Schema of this project is a copy of Max De Marzi Database Schema from Flight Search article. I highly recommend you to read his blogs from time to time, it is a great example of high-level expertise, full of creative ideas and interesting topics. Let’s look at the nodes and relationships.
For the answer to the question “Why it’s like that?” you’d better read in his blog post. For now, we will not discuss Data Modeling at depth (it is a topic for the second part of my series). Rather, I want to focus your attention on the technical solution of importing massive data, that should fit Database Schema.
Initial Data
Flight domain is a popular area and I found wonderful web resource, that provides enough amount of initial data: openflights.org.
Useful data from openflights includes these three .csv files:
~6162 airlines.csv
{ AirlineId, Name, Alias, IATA, ICAO, Callsign, Country, Active }~7699 airports.csv
{ AirportId, Name, City, Country, IATA, ICAO, Latitude, Longitude, Altitude, TZ_UTCHoursOffset, DST, DST, TZ_Olson, Type, Source }~67664 routes.csv
{ Airline, AirlineId, Source_Airport, Source_AirportId, Destination_Airport, Destination_AirportId, Codeshare, Stops, Equipment }
Data Discovery
With this data, we can already build Airports nodes and relationships between them. Airports nodes almost never change, the number of Airport nodes does not depend on time, they are rarely added or removed from the Database. Routes between them also extremely stable. If we have at least one flight between 2 Airports, then we have FLIES_TO
relationship between them.
The file routes.csv contains information about the Airline and 2 connected Airports, this means we can build a Flight node from it. How to utilize Airline nodes? Well, we can add OPERATED_BY
relationship between node Flight and node Airline.
Airports and Airlines is stable information, it change rarely. There is also very active data in the Database, and representation for it that should be created for every date: AirportDate
nodes and Flight nodes. Every Airport is connected with AirportDay
node for each date. AirportDay
is connected with all outgoing or incoming Flight nodes. By the direction of the *_FLIGHT
relationship, we know where the Flight is going to.
For example, if we have 2 connected Airports and 1 Flight operated in 2 days, we need to insert 2 AirportDay
nodes for each Airport and 1 Flight for each pair of AirportDay
nodes.
By having an understanding of the nature of the data, you can plan how to build all necessary amount of nodes and relationships.
We need to have some operating days in the Database, so let's make an assumption that all of existing flights is operated every day. For example, 1 way directed route between 2 Airports for 30 days gives us the next nodes and relationships:
1 Airline => 1 Airline node
2 Airports => 2 Airport nodes (BKK, SVO)
1 Route => 1 FLIES_TO relationship (BKK --> SVO)
2 Airports * 30 days => 60 AirportDay nodes
60 AirportDays => 60 HAS_DAY relationships
1 Route * 30 days => 30 Flight nodes
30 Flights => 60 BKK_FLIGHT relationships (in and out directed)
1 airline 30 Flights => 30 OPERATED_BY relationships
Now we can say, that the “business logic” for Flights Database generation is defined. We don’t really need all this math, but we need to know how the data-generation console application will work. But before coding a generator application, we need to understand how output files should be formatted. Files produced by generator should be ready to use with the neo4j-admin import tool.
How to use the neo4j-admin import tool
Here a few words about the tool, so you don’t need to click on the documentation link just yet. The import tool consumes .csv files with a specific structure, and imports thousands of nodes and relationships into the graph in just seconds.
Command-line interface
neo4j-admin import /
--database=flights.db / #database to import
--mode=csv / #file format
--nodes=... / #files to import nodes
--relationships=... / #files to import relationships
--ignore-missing-nodes #some settings
You can choose between the single-file mode and multi-file mode to import your data.
Single-file mode
Nodes
airlines.csvcode:ID,name:STRING,country:STRING
SU,Aeroflot Russian Airlines,RussiaCommand-line interface--nodes:Airline="airlines.csv"
All nodes are marked with the Airline label. :ID
is a unique ID and is used across the whole import process. It is not the ID for your future nodes or relationships. It is important to ensure that no two entities of nodes or relationships have the same ID during import.
Relationships
fliesTo.csv:START_ID,distance:INT,:END_ID
BKK,7111,SVOCommand-line interface--relationships:FLIES_TO="fliesTo.csv"
All relationships have FLIES_TO
type. :START_ID
is node ID of an outgoing node, while :END_ID
is node ID of an incoming node. This snippet shows an example of BKK → SVO. All relationship properties are in the middle, here we have only one — distance.
Multi-file mode
Nodes
flights_header.csvflightNumber:ID,departs:DATETIME,duration:STRING,distance:INT,price:INTflights_data_20191201.csv
flights_data_20191202.csv
flights_data_20191203.csv and moreSU_BKK_SVO_20191201,175200.000+0700,P09H09M,7111,16554Command-line interface--nodes:Flight="flights_header.csv,flights_data_.*"
The header file contains only a declaration of columns, while all the data placed in separate files or one data-file. This approach very useful when you have a lot of data to import and it makes sense to keep these files in reasonable size limits.
Relationships
hasDay_header.csv:START_ID,:END_IDhasDay_data_20191201.csv
hasDay_data_20191202.csv
hasDay_data_20191203.csv and moreBKK,BKK_20191201Command-line interface--relationships:HAS_DAY="hasDay_header.csv,hasDay_data_.*"
Example with Relationship Type declaration inside Data file
If you want to have different Labels for your nodes or Types for your relationships, you can try another technique. You can define a node Label or relationship Type along with other properties in each row of your data to be imported.
inFlight_header.csv:START_ID,:END_ID,:TYPEinFlight_data_20191201.csv
inFlight_data_20191202.csv
inFlight_data_20191203.csv and moreBKK_20191201,SU_BKK_SVO_20191201,BKK_FLIGHT
BLQ_20191201,SU_BLQ_SVO_20191201,BLQ_FLIGHTCommand-line interface--relationships="inFlight_header.csv,inFlight_data_.*"
The same is valid for nodes with Labels, using :LABEL
statement instead of :TYPE
.
If you want to know more about other settings and commands then check out the documentation.
Import Overview
The Flights Database files generation is not a trivial task. We will need to generate 9 separate chunks of import data: Airport
, AirportDay
, Flight
, Airline
, FLIES_TO
, HAS_DAY
, OPERATED_BY
and *_FLIGHT
in both directions. And to remind you, some of that data is dependent on dates, so it is also required to prepare separate headers and data files for each date. Flight price and duration of flight are both heuristic guesses based on distances between airports. You can imagine how strongly I exhaled when I finally wrote a generator application!
I think the generator application code and the application itself are optional topics and are not discussed in this article. If you feel you will be faced with a problem to write an import file generator, choose any programming language you know and simply write a generator.
Flights Docker Image
Congrats! You are at the final base: how to build a Docker Image with imported Database inside.
The idea behind this task is a super simple story:
- Copy all .csv files into docker image
- Copy the import script into docker image
- Set up the execution of the import script at the containers launch
- Select the new database as an active one
Dockerfile
FROM vladbatushkov/neo4j-apoc-algo-graphql:latestCOPY import/*.csv import/COPY import.sh import.shENV EXTENSION_SCRIPT=import.shENV NEO4J_dbms_active__database=flights.dbCMD [ "neo4j" ]
import.sh
The EXTENSION_SCRIPT
allowed us to define an import script that will be executed before Neo4j starts.
docker run -p 7474:7474 -p 7687:7687 -v c:/neo4j/data:/data -v c:/neo4j/logs:/logs vladbatushkov/neo4j-flights:dev
Volume params are optional. It just helps you to build several containers using the same database data, avoiding the execution of import scripts.
Both Docker Images from this article at available from my Docker Hub page.
I bet you are interested to see how the flow of the import works! How fast is the import? How many nodes and relationships will there be in the Flights Database for one year ahead? What is the approximate size of the database? Will all of these things actually work?! Well, let’s try a small benchmark test and look at the numbers.
Import Benchmarking
My docker environment is going to be the same for every import.
Available resources:
Total machine memory: 9.73 GB
Free machine memory: 8.83 GB
Max heap memory : 2.16 GB
Processors: 4
Configured max memory: 6.81 GB
High-IO: false
I want to compare One-day import (1 of January 2020) VS One-month import (January 2020) VS One-year import (2020).
For example, here you can see the contents of a folder with one-day data ready for import. All files with a name pattern like *_data_YYYYMMDD
are going to be created for every import day.
One-Day Import Results
IMPORT DONE in 3s 943ms.
Imported:
74 057 nodes
226 113 relationships
386 217 propertiesSize: 19.53 MiB
MATCH (a:Airport)-->(ad1:AirportDay)--(f:Flight)--(ad2:AirportDay)<--(b:Airport)
MATCH (f)-->(al:Airline)
WHERE a.city = "Bangkok" AND b.city = "Moscow"
RETURN *
One-Month Import Results
IMPORT DONE in 30s 167ms.
Imported:
2 101 007 nodes
5 977 023 relationships
9 861 087 propertiesSize: 479.99 MiB
One-Year Import Results
IMPORT DONE in 5m 39s 912ms.
Imported:
24 735 282 nodes
70 195 518 relationships
115 663 802 propertiesSize: 5.49 GiB
End of Part 1
Thanks for reading!
In the next article of this series, I plan a dedication to querying the Neo4j Database, and going deep into the details on how to write a Cypher query searching for flights. Stay in touch and clap-clap-clap.