Yet Another Awesome Geo-coder — Open Sourced Spatial Search

Mohammed Ayub
11 min readJun 26, 2019

--

What if you collect very large amounts of spatial information such as millions of addresses annual that should be used for geographic analysis but are skeptical of using commercial geo-coders (like Google, HERE, ESRI etc.) or are discouraged by the slow speed of many open source like PostGIS TIGER Geo-coder. There’s a good, if still imperfect, alternative to these options in the works called as “Pelias”. Initially started by some amazing people from Mapzen before that company folded, Pelias now a part of Linux Foundation Project.

Geo-coding at its roots is searching through known addresses or geographies (called ‘shapefiles’, in geeky terms) to find the best match (a Latitude/ Longitude) for a given input address. Pelias makes this possible by indexing globally shared spatial data from various sources into a database called Elastisearch which basically allows for spatial search on steroids !!

You might have come across previous medium posts from Postzen team about setting up Pelias Instance and the progress the project has made since its inception in early 2015 (or before). If your wondering why write another post, there are couple of reasons. Firstly — the entire platform can now be run on top of docker containers that involves much simpler steps to setup than before. The Docker approach raises a host of new questions, and this blog will hopefully address these issues and provide some solutions. It’s a bit of a lengthy read, but it will be worth your time.

The easiest (and the recommended) way to setup Pelias currently is through Docker containers. In early 2017, Pelias decided to break up its API modules into different micro-services and multiple GitHub repositories for different data import pipelines, which makes it look like a daunting task to understand and set one up. By using the new Docker setup you don’t have to worry about most of the gory details of setting-up each micro-service individually or running into serious issues while downloading data for specific regions of interest.

Getting the prerequisites out of the way !!!

System Requirements:
1. Spin up the EC2 instance —
Since Pelias Geo-coder supports from multiple countries, the specs here might vary on how much data you need to download and what region you are interested in. (At-least 8 GB and 300 GB storage recommended).
I have used this AMI (ami-0653e888ec96eab9b) Ubuntu 16.04 /SSD Volume Type with m5a.xlarge and 300GB of storage. (Scale Up/Down as necessary)

(Optional) Creating a Cloud9 Environment for development — You don’t necessarily require this section. As an alternative you can just SSH from any available client (such as Putty etc.)

Install NodeJs on your created instance for Cloud9 — https://tecadmin.net/install-latest-nodejs-npm-on-ubuntu/.

Install NodeJS on Instance

curl -sL https://deb.nodesource.com/setup_10.x | sudo -E bash -
sudo apt-get install nodejs
node -v
npm -v
which node
#Copy this path for later (usually it’s /usr/bin/node)
sudo mkdir cloud9
#Optionally Set the working directory for Cloud9 (Default would be /home/ubuntu if not set)

In your AWS Management Console, search for Cloud9

  • Select Create Environment
  • Enter name and description
  • Select Connect to remote server (SSH) option.
  • Enter Username and Host Name of instance created in previous step.
  • Copy the Public SSH key into the Ubuntu Instance.
  • Add the NodeJS binary path which we copied earlier.
  • Click Next and Continue. It should open up new Cloud9 tab which will ask you to install a few requirements before you start using it. It will take some time to install the Cloud9 IDE in your environment.
  • When it prompts “Installation Completed” Click Finish.
    You should be presented now with the Welcome Tab.

2. Install docker and docker-compose
(There is always a DigitalOcean Blog for these kind of things, however below steps worked for me)

Install Docker and Docker Compose

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository “deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable”
sudo apt-get update
apt-cache policy docker-ce
sudo apt-get install -y docker-ce
sudo systemctl status docker
sudo docker info
sudo docker run hello-world
sudo curl -L https://github.com/docker/compose/releases/download/1.18.0/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose — version

— — — This brings us to the end of Prerequisites for Pelias — — —

Running the Automated Build…

Now we will step through the automated build provided by the Pelias team. Instead of running all commands at once we will step through the download,prepare and import steps separately for each data source to see what is happening at each step. You can also run the download, prepare and import step for all data sources in parallel using pelias download all, pelias prepare all, pelias import all statements.

Will be setting this up for Portland Metro Area. However, the steps remain same for other example projects provide. The DATA_DIR in .env file and the pelias.json data source attributes change accordingly.

Create directories for the code and data files

sudo mkdir /code /data
sudo chown 1000:1000 /code /data

Clone the repository

cd /code
git clone https://github.com/pelias/docker.git
cd docker
sudo ln -s “$(pwd)/pelias” /usr/local/bin/pelias
which pelias

Every time you run Pelias CLI command, it will look for .env file and a couple of variables to see if they are set. You will see this file in each of the example project folders. Be sure to navigate to the respective folder to run the Pelias CLI commands.

# Change working directory
cd projects/portland-metro
# Configure environment
sed -i ‘/DATA_DIR/d’ .env
echo ‘DATA_DIR=/data’ >> .env
# Check all the environment variables are set correctly
sudo pelias system env

Above commands are part of the helper script the team has created to manage all Pelias docker images (similar to docker style commands).

Step 1. Pull down all the required images

Pull all Pelias Images

sudo docker image ls 
sudo pelias compose pull
Before running the pull command
After running the pull command

As you can see it has pulled down images for the following -
Data Sourcesopenstreetmap, whosonfirst, openaddresses, polylines, transit
CSV Importer (eg. import Geonames CSV files) — csv-importer
Elasticsearch Database elasticsearch
Performance and Accuracy settings for Pelias schema (from the code it looks like it contains tuning parameters for accuracy like how to traverse the geographical hierarchy etc.)
TravisCI based Statistical Machine Translation Test Suite for Pelias system fuzzy-tester
Point in Polygon Service based on WhosOnFirst data pip-service
Location Approximation Module (for non-exact address matches) — interpolation
NLP Parser to detect geographic based input texts placeholder
Address Pre-processing module libpostal-service
API Module to run HTTP requests like search,reverse,autocomplete operations — api

Note on Pelias.json → All the downloads and imports of data into Elasticsearch is driven by this JSON file. You should see this in every project folder. All the images that were downloaded in the previous step have certain configurations that are set in this JSON file. Like Index setting for Elasticsearch, API Endpoint URL’s, Data source URL’s and path where to store them etc.

Step 2: Start the Elasticsearch database service (sequence is important)

Start the Elasticsearch Service

sudo pelias elastic start
sudo pelias compose ps
sudo pelias elastic wait
sudo pelias elastic create
Starting Elasticsearch Service
Creating pelias index to add documents

Step 3: Download the required data from each source (sequence is important)

Download WhosonFirst DATA

sudo pelias download wof
Config for WOF Download in pelias.json

Looking at the JSON file you can see we are downloading two places with ID’s ‘85688513’ and ‘85688623’. You can use this Spelunker tool to get more place ID’s you want to download the data from. You can use the search bar on the top right or just add ‘/id/<numeric_id>’ in the URL to get the details. We are storing all data in ‘/data/whosonfirst’ location. See all available JSON options here. This creates a SQLite database to store all the data. (Optional) You can install sqlite3 and query the databases to verify the data downloaded.

(optional) Install sqlite3

sudo apt install sqlite3
Querying the WOF Downloaded Data

Download OpenAddresses DATA

sudo pelias download oa
Config for OA Downloads in pelias.json

You can see we are downloading 7 csv files and storing them in location ‘/data/openaddresses’. Right now its bit of a painful process of figuring out the right csv’s for diffirent regions. Thankfully the team has made available the csv’s for custom regions already. See all CSV File Listing here. See all possible configurations here.

How OpenAddresses talks to WhosOnFirst data to get more info for each record using Admin Lookup service (like region, neighborhood, continent etc.) read about it here.

Downloaded files

Download Open Street Map (OSM) DATA

sudo pelias download osm
Config for OSM Downloads in pelias.json

There are two sources of getting OSM Data OSM Apache Server or EU Server . As you can see this config takes ‘.pbf’ files. If you run into issues downloading huge .pbf files (like the full planet build or North America build, either increase the hardware specs or download the state-wise .pbf files). See all possible configurations here.

How OpenStreetMap talks to WhosOnFirst data to get more info for each record using Admin Lookup service (like region, neighborhood, continent etc.) read about it here.

OSM Download

Download TIGER Data

sudo pelias download tiger
Config for TIGER Downloads in pelias.json

Tiger data is used for interpolation service. Mainly for US region, as specified we are downloading 7 US county files which contain address block ranges for interpolation. [Note: If no state values are specified then entire US address feature files are downloaded]. It downloads the zip files and extracts them as shape-files into separate folders.

Download Transit Data

sudo pelias download transit

It downloads common landmarks,stops and street intersections. Its just a bunch of CSV files made available from different sources.

Step 4: Prepare the downloaded data to be ingested into Elasticsearch (sequence is important)

Prepare Polyline data

pelias prepare polylines

This will check for atleast one .pbf files in your ‘/data/openstreetmap’ and extract road network from it into polyline format. If you face issues preparing huge PBF files into polyline format, use Valhalla to do so as recommended.
See all available configurations here.

How Polylines talks to WhosOnFirst data to get more info for each record using Admin Lookup service (like region, neighborhood, continent etc.) read about it here.

You might see info messages like ‘found 0 refs for way xxxxxx’. These are not error messages and you can safely ignore them.

Prepare Interpolation

pelias prepare interpolation

It prepares data available from Polyline Extract, OA, OSM, Tiger into two sqlite databases. One called street.db and other as address.db. Read more about the workflow process here.

Prepare Placeholder

pelias prepare placeholder

The following best describes it —

“Placeholder supports searching on and retrieving tokens in different languages and also offers support for synonyms and abbreviations.
The engine includes a rudimentary language detection algorithm which attempts to detect right-to-left languages and languages which write their addresses in major-to-minor format. It will then reverse the tokens to re-order them in to minor-to-major ordering.”

It extracts all the GeoJSON’s present in WhosOnFirst data folders as wof.extract file and then loads them as sqlite database record.

Step 5: Start the import into Elasticsearch (sequence is important)

Import WOF Data

sudo pelias import wof

You might get some connection errors on Elasticsearch bulk parse. It will retry to parse records and finish import.

Import OpenAddresses Data

sudo pelias import oa

As you can see it uses the ‘wof-admin-lookup’ and ‘wof-pip-service:master’ in the background

Import OpenStreetMap Data

sudo pelias import osm

Import Polyline Data

sudo pelias import polylines

Import Transit Data

sudo pelias import transit

Step 6: Spin up the API services

Config for API Service Endpoints in pelias.json
sudo pelias compose up
sudo pelias compose ps
docker ps output

Step 7: Run the tests (optional)

90% of the test run fine so we are okay. Please see the test run strategy deployed by the team Here and Here.

Step 8: Running some examples to test the API’s locally
To learn more about which Endpoint to use for different tasks, refer to this page.

Here are some examples I tried:

Search Endpoint

curl -G ‘http://localhost:4000/v1/search’ --data-urlencode “text=1901 Main St”

Reverse Geo-Coding Endpoint

curl -G ‘http://localhost:4000/v1/reverse’ --data-urlencode “point.lat=45.433467 point.lon=-122.650095”

Point-in-Polygon Service Endpoint

curl http://localhost:4200/-122.650095/45.533467

Hoping this was helpful. Happy Geo-coding !

Feel free to post any comments,questions, concerns and emotional outbursts.

(Coming Up Next …) Running Batch Geo-coding using Pelias

--

--