Easy Access to All Points of Interest Data
Scraping OpenStreetMap and exploring POI in Cloudant and Jupyter Notebooks
When working with data, the format of the raw data is not always user-friendly. For instance, the format could be one large binary file, or the data could spread across hundreds of text files. An easy way to solve this problem is to convert the data and store it in a database.
As an example of how to make working with data simpler, Raj Singh and I converted all the Points of Interest data from the global OpenStreetMap (OSM) project to GeoJSON files, which we then stored and are periodically updating in IBM Cloudant, a database service based on Apache CouchDB™. The data is now easily accessible through an API, which you can try for free.
OpenStreetMap is built by a community of mappers that contribute and maintain data about roads, trails, cafés, railway stations, and much more, all over the world.
Read along to learn how we built it and how you can use the data. (Note: Should you reproduce all the work described below, you will likely incur costs for Cloudant.)
OpenStreetMap Data
The first step is to download the most recent data for each continent. We used Geofabrik, which extracts, selects, and processes free geo data from OpenStreetMap. The examples that follow use data from Europe, but for complete global coverage, all steps need to be repeated for each continent.
Converting the Data
The second step is to extract the Points of Interest (POI) from this large file. We used Osmosis, which is a command line Java application for processing OSM data. You can easily install it on a Mac with brew
. We used it to extract all the POI data based on a selection of features.
osmosis --read-pbf europe-latest.osm.pbf \
--tf accept-nodes \
aerialway=station \
aeroway=aerodrome,helipad,heliport \
amenity=* building=school,university craft=* emergency=* \
highway=bus_stop,rest_area,services \
historic=* leisure=* office=* \
public_transport=stop_position,stop_area railway=station \
shop=* tourism=* \
--tf reject-ways --tf reject-relations \
--write-xml Europe.nodes.osm
The file Europe.nodes.osm
contains all POI in Europe, but also some data that we do not need. A handy tool to scrub OSM data is osmconvert. With this tool, selected data can be dropped from the file.
osmconvert Europe.nodes.osm — drop-ways — drop-author — drop-relations — drop-versions Europe.poi.osm
The third step is to convert POI data to the GeoJSON format. A good tool for this job is ogr2ogr, which is part of the GDAL library, which you can install with brew install gdal
. Note that we are only interested in points, so only POI data is added to the GeoJSON file Europe.poi.json
.
ogr2ogr -f GeoJSON Europe.poi.json Europe.poi.osm points
Uploading Data to Cloudant
Each of the POI objects from the large GeoJSON file needs to be stored in a separate document in the database. To upload them to Cloudant we used couchimport, which does exactly that (and more).
IBM Cloudant is a NoSQL database that you can try out for free after signing up for a Bluemix account. Cloudant has a perpetually free tier, but please check Cloudant pricing if you anticipate heavier long-term use. For example, scraping POI data for the whole world took us 5.26 GB!
export COUCH_TRANSFORM=./osm_poi_transform.js
export COUCH_URL=''https://username:password@opendata.cloudant.com''
cat Europe.poi.json | couchimport --db poi-db --type json --jsonpath ''features.*''
These commands upload all POI features to a database called poi-db
. The file osm_poi_transform.js
contains extra information to use the osm_id
as the document id
and to format the keywords.
Keeping the data up-to-date is done by weekly running a Python script that downloads the OSM change file and uses the above tools to create a GeoJSON file with all new or updated POI.
As the change file contains both new and updated POI, the Cloudant Python library is used instead of couchimport. With this library, a POI record can be replaced, or if it is a new POI record, added via the following code from our POI API service:
Easy Access to the Data
Now that the database is ready, it is time to look at the data inside it. You can visualize GeoJSON inside the Cloudant dashboard, or by using the Cloudant APIs. To be able to use Cloudant’s geospatial functionalities, a design document with a geospatial index function needs to be added as in the screenshot below.
After the index has been built (processing can take a while for a large database), you can explore the data in the dashboard by, for instance, drawing a box on a map as below. Interacting with the map in the dashboard will also give you the corresponding API call for this query. It’s a convenient feature for further extending your query with some hints from the getting started example and Cloudant Geospatial documentation.
Analyse the Data in a Python Notebook
Another way to access and analyse the data is in a Python notebook. The examples below are designed for you to be able to easily copy & paste them into a Jupyter Notebook. You can run your notebook locally or in the cloud. We ran ours in the cloud using the IBM Data Science Experience (DSX) platform, which you can try out for free.
With the pandas and PixieDust packages, you can use the URL from the Cloudant dashboard above to start exploring. The code below will load a JSON file with data of the 200 POIs from the above map into a pandas DataFrame. To load the properties of the POI data, add &include_docs=true
to the URL.
The DataFrame needs some cleaning up, as all the variables are combined into one column: rows
. Extracting the fields you are interested in can be done with a lambda
function, which is included below. It uses the functiontry_field
for each row. It checks if a field exists, and if it does writes the value to a new column. This code example only checks a few fields, but there are many more, as you can see in the features selected with osmosis
above. After adding the new columns, the original columns bookmark
and rows
can be dropped.
Create a Map with PixieDust
PixieDust is a great Python package to quickly visualize your data in Jupyter Notebooks. The formatted data above can be plotted on a map with the following code. First, you’ll need to add an extra column to specify which points are a shop, public transport, or an amenity. Then you can make a map by simply using the display()
command and selecting a map from the menu.
PixieDust has two map renderers. To visualize your POI data, you’ll need to choose mapbox. (Currently, the google maps render in PixieDust only uses simple location data, like country codes, and not latitude & longitude.) As such, you’ll need a Mapbox access token, which you can get for free by signing up for an account. Enter it in your visualization’s Options dialog, like so:
You’ll want to specify latitude and longitude as your keys, with a numeric value like shops or amenities as your value.
Use the Points of Interest API
You can also try this analysis using our POI API. Connecting to it is a little simpler and cleaner than loading the data directly from Cloudant, and you can grab more data in one call.
Try it out by replacing the corresponding notebook cells with the following snippets:
Some Final Thoughts
As you might have noticed, there is no password needed to access this data set. As the OSM data is open data, we are keeping this POI database open as well. Feel free to have a play with the data. We would love to hear what you are building!
If you enjoyed this article, please ♡ it to recommend it to other Medium readers.