Flights Search Application with Neo4j —Using Cypher queries and APOC Custom Procedures (Part 2)

How to write your own APOC Custom Procedures & set it up in Docker using Cypher-shell

Vlad Batushkov
Neo4j Developer Blog
14 min readJan 15, 2020

--

Part 1: Dokerizing

Target

In this series of articles, I will share my experience of building a simple web application that you can use to search for flights. The application was built for Neo4j Workshop in Bangkok, November 5, 2019.

The general idea is to have a page with a search box where you can define a place of departure and destination, choose the date and voila — a list of available flights would appear below. Our needs are limited by only a one-way flight, no class options, and no other fancy options. A brutal one-way flight for MVP. Solving this task includes doing one by one smaller sub-tasks.

The Flights Search Application is a perfect use-case for building using GRANDstack framework and the “how to do it” will be covered in the third part of this series (next part).

In this article, I will show you how to create your own APOC custom procedure (similar to Microsoft SQL Server (MSSQL) stored procedures) to query for Flights, which takes into account real-life travel requirements as much as possible. Also, I will explain how to setup procedure and index scripts inside a Docker Image using Cypher-shell.

Remark: If you are not interested in Flights Search query details at all, but would like to see how to pack Docker Image using Cypher-shell — you can skip to the last section: “Setting up Cypher-shell in Docker”.

Database Schema

In the previous article, I explained how to build a Neo4j Docker Image with a Flights Database using the neo4j-admin import tool. To look at the Flights Database, simply pull my docker image and locally run the container. The image contains imported data for January 2020.

The Flights Database is based on Max De Marzi’s Flight Search schema. Max implemented his search as User Defined Procedure, written in Java. I would like to build the same functionality in Cypher.

Max De Marzi original content. Flight Search Proof of Concept Database Schema.

Database Improvements

Since the date of publication of the first part of this series, I made several critical changes to my graph. The Flight node now contains local date-time of departure and arrival. For example, these properties in Flight Node from San Diego, the USA to Tokyo, Japan:

This also means that for a flight that departs “today” and arrives “tomorrow” the nodes are now connected to two different AirportDay nodes representing different days (see the picture below). I also fixed the relationship [*_FLIGHT], naming it with the Airport of arrival, not with the departure Airport.

Example of Sub-graph with data from Moscow SVO to Bangkok BKK

I am also thinking about segregating City nodes, maybe you can help me with this feature.

Query Strategy

Flights Search is a graph-oriented problem with a graph-oriented solution. Finding flights requires a traverse through all possible paths of the graph to find those, those match criteria.

Initial results can look like this: a single direct flight with an appropriate departure date from Airport A to Airport B.

Example of Sub-graph with data from Bangkok BKK to Moscow SVO

Max De Marzi’s strategy to solve this problem is to split the search query into two phases. First phase — find all paths between Airports in the “upper part” of the schema.

The upper part of Schema

The second phase — using Airports go to the “lower part” of the schema and find all the related flights.

The lower part of Schema

Sounds easy, doesn’t it? But don’t rush to plan your evening just yet. Reality is cruel.

Upper Part

Airport FLIES_TO Relationships example

It’s not a flight search query, it’s just a query to get data for the picture.

Let’s analyze a small example; flights in Germany. In the picture above several cities are connected, thus potentially giving us many options to fly from Berlin-Tegel Airport to Munich Airport:

TXL (Berlin) → MUC (Munich)

TXL (Berlin) → STR (Stuttgart) → MUC (Munich)

TXL (Berlin) → STR (Stuttgart) → DRS (Dresden) → MUC (Munich)

TXL (Berlin) → STR (Stuttgart) → BRE (Bremen) → MUC (Munich)

Lower Part

Path discovery is the heart of the query. Let’s go deeper and find all possible flights between each pair of Airports for all our paths.

Flights between some German Airports for Jan 1, 2020

It’s not a Flight Search query, just a query to get data for the picture.

Now you can start to imagine the whole picture. There are six direct flights from Berlin-Tegel Airport to Munich Airport, and other flight options:

TXL (Berlin) → 6 flights → MUC (Munich)

TXL (Berlin) → 6 → STR (Stuttgart) → 3 → MUC (Munich)

TXL (Berlin) → 6 → STR (Stuttgart) → 2 → DRS (Dresden) → 2 → MUC (Munich)

TXL (Berlin) → 6 → STR (Stuttgart) → 2 → BRE (Bremen) → 2 → MUC (Munich)

Knowing all the different flights available, creating all combinations of these would give us all the possible results. For example, TXL → 6 → STR → 2 → BRE → 2 → MUN can give us 24 possible combination flights to book. For all four Airports routes described above, in total we have 6 + 18 + 24 + 24 = 72 possible flights route results.

But all these results are just “possible”. The next filter conditions (we can call them requirements) make flights searches a bit closer to the real solution.

Flights Requirements

Departure After Arrival

When we have flights from place A to place B and then from B to C, we should exclude flights that leave from the next leg before we’ve arrived to catch it. For example, we should exclude flight B to flight C if flight A has not arrived yet to B. We can also consider having some between flight Airport time buffer (e.g. an hour) as a comfortable time to eat a burger. We may even need more time before the next flight. Anyway, this custom Airport transfer time-window can be an interesting feature for our service.

Next-Day Arrival

Usually, international flights cross several time zones. Time-aware search is our must-have feature. In the trip from A to C via B, when we fly from place A to place B we can reach B on the “next” day. That means a departure from place B to C should involve a search for matched flights on that “next” date.

For example, you are flight from Buenos Aires, Ezeiza International Airport to John F. Kennedy International Airport, New York.

Next day Arrival EZE → JFK. All values was generated by my console app, using random & heuristic assumptions.

Boundary Of Dates

Bigger complication come when we consider arrivals around midnight. For example,if you arrive at 22:00 , you have 2 hours on “this” day and the time of the upcoming “next” day. The search period should consider both “today“ and “next” day flights.

Remark: This requirement is still not covered by my Flights Search Query. You can help me to develop this feature.

Irrational Transfer

When we are looking for all possible paths between place A to place C via place B, we should exclude place B, if B’s distance is irrationally far. It would be weird to suggest a user buy a ticket on a flight from Paris to London with a transfer in Moscow.

I decided to compare total flight distances with +5% of the direct distance to handle these irrational transfers. Of course we could use something much smarter to calculate the threshold.

Same City

One more location-dependent filter-out condition to consider: paths should not include Airports from the same city. This can happen for trips with 2 Airport transfers. For example, Airport A (Berlin) → Airport B (Stuttgart) → Airport C (Berlin) → Airport D (Munich). For sure, we should seek to exclude such a route from the result list.

Flights Search Query

Time to write a query. First of all, let’s define our input parameters. The minimal required parameters for a one-way trip are the two Cities and Departure date.

UI powered by agoda.com/flights

Path-finding

Path-finding is a natural and easy task for Neo4j. With the “City From” and “City To” parameters we can already find all necessary paths. This part of the query does not depend on the departure date, but it will be used soon.

As you can see, I have set a maximum number of hops to 3 possible [:FLIES_TO] relationships between Airport nodes. This means I have limited my solution to 2 possible Airport transfers in between my departure and destination locations. Realistically speaking, I am most likely to fly with just 1 Airport transfer, but of course 2 Airport transfers is a very reasonable scenario that needs to be covered.

The variable path keeps a chain of nodes and relationships based on our pattern, with only Airport nodes connected with [:FLIES_TO] relationships. The resulting paths are a combination of direct flights, flights with 1 Airport transfer and flights with 2 Airport transfers.

By doing nodes(path) we can select just the nodes from these paths, called routes:

A query for flights from Moscow to New York returns 7549 possible routes. Too much to be true, right? But it is a truth with only one condition: the list includes a lot of “not valid” results from the point of good sense. Do not blame the query, it’s just doing what we asked of it. Let’s apply some requirements and filter out the “not valid” outputs.

Irrational Transfer

We have city’s geographical location in the initial data, together with standalone City nodes. To calculate an “approximate” City location we will use the average location of all city’s Airports.

This is another Database Schema improvement to implement, but for now, I will add a distance computation to my query.

Same City

Applying the “Same city” business requirement is quite easy with the help of APOC. Counting the city name duplicates in the list should be equal to 0 (no duplicates). Both requirements added to the previous search query.

This query gives us much more realistic result of just 78 paths that we need to process in the next step. As you can see, a lack of meaningful requirements leads to totally wrong results.

Index

We already touched on property-based filtering, so I recommend setting all required indexes to improve query performance. I expect to have the parameter of a departure date, so I need an index for AirportDay filtering by the code property.

Direct Flights

Direct flights are flights that only have 2 Airport nodes in the path. Nothing complex here. From the list of all paths, we will return these direct flights with the simple (AirportDay → Flight) pattern.

This is a temporary query, that contains a path with direct flights.

Transfer Airports

Now using the same approach we will find a collection of Flights for routes involving 3 to 4 Airports. The WHERE condition should support the requirement “Departure after Arrival”.

I also want to be able to choose the transfer time-window, that I mentioned before. When we arrive to some “transfer” Airport we can decide how long we’d like to remain there until our departure. So the query input parameters would look like this:

Plus the final step to combine all available flight routes into one result list. You can append the Flights Search query with PROFILE statement to see the execution plan.

An important note, all three matches are OPTIONAL as we don’t know whether all direct flights exist or not. This is the same situation for more than one-hop transfers too.

Query Performance Lesson

Everything seems to be straightforward with my query, but it does take around 16 seconds to find results.

The root cause of this slower performance is by the unexpected NodeByLabelScan. Here is a snapshot of the Execution Plan (generated from PROFILE) for the part with the longest pattern of 2 Airport transfers (before the fix).

NodeByLabelScan

I already have an Index for AirportDay(code), so why isn’t NodeIndexSeek being used?

The remedy and explanation for this behavior was found in the old good article “Tuning Your Cypher: Tips & Tricks for More Effective Queries” written by Petra Selmer & Mark Needham, Feb 10, 2016.

In graph databases, Indexes are only used to find the starting point for queries.

This explains everything. I need to tell my query to use indexes for the intermediary nodes.

Using the indexes, it can find the results just in 4 seconds using one month (Jan 2020) of data containing 1 237 759 nodes and 3 376 990 relationships.

If you have more ideas on how to improve the performance of my query — feedback is welcomed. Here is a print screen of my query execution plan:

Setting up Cypher-shell in Docker

Having a query is wonderful, but it is not a full solution yet. The last question is: how can one use this query monster inside my Application? Inline hard-code in application code-base? Maybe. But can we do better?

If you are aware of Relational Databases and MSSQL, you will be aware of the concept of Stored Procedures. This database level solution has been widely used for decades in the relational databases world. It looks like a handy solution to me. Let’s create a custom procedure using the APOC library.

APOC Custom Procedure

Wrapping Cypher queries as a procedure is a really easy thing to . The documentation provides a very clear explanation of how to do it. All you need is to define a name, description, input and output parameters for your query.

Here is a link to the gist for those of you who would like to see the full version. The procedure is created under a custom namespace. Now you can easily CALL it.

YIELD will iterate using output alias that was declared — the result, in my case.

Cypher-shell in Docker

In the first part of this blog post series I explained how to import data into your Neo4j Docker Image. The import process is executed before the Neo4j service starts. However, when we are setting up procedures and indexes, they are initiated after the Neo4j service has started.

Setup your custom procedures and indexes right after Neo4j has started

I will use the solution for such a non-trivial bash-scripting task created by Shaun Elliott. Following his approach, I will set up three things for my neo4j-flights image: a procedure and 2 indexes.

Dockerfile

getFlights.cql
I prefer to use .cql (Cypher query language) extension, but it also can be .cypher

wrapper.sh

setup.sh

How to use Cypher-shell

The simplest way to use Cypher-shell is to run inline execution, as I did for the index creation.

If your query is not a one-liner, or you want to process a bunch of queries, you will probably want to keep the queries as separate files. Is it possible to ask Cypher-shell to run a query from a file? — Yes. But we need to read the file contents first and then send it to Cypher-shell. It’s going to be something like this:

Be sure that each query ends with a semicolon. For example in indexes.cql

Finally, let’s start Docker Container and open Neo4j Browser. Our getFlights procedure is in place and working as expected. By the way, you can easily test it using Neo4j HTTP API.

End of Part 2

Thanks for reading!

In the last article of this series, I will share with you how to build a Search for Flights Web App using the GRANDstack Framework. Clap-clap-clap for every flight with transfer in your life.

Photo by Juhasz Imre from Pexels

--

--

Vlad Batushkov
Neo4j Developer Blog

Engineering Manager @ Agoda. Neo4j Ninja. Articles brewed on modern tech, hops and indie rock’n’roll.