Analyzing Geospatial data in Apache Spark

Published in

IBM Data Science in Practice

6 min readMar 24, 2021

Geospatial data plays a crucial role in data forecasting, spatial analytics, and reporting, especially in the fields of logistics and finance. Additionally, there are increasingly more use cases where technologies like artificial intelligence have been used in tandem with geospatial applications. The purpose of this blog is to share a scenario where you use the Geospatial algorithm in Apache Spark.

Yellow Taxi — Image Courtsey — Clay Banks

Using Apache Spark, which is an open source, scalable, massively parallel, in-memory execution engine, you can build analytics applications which can include prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala.

Value ads in IBM analytics engine. It’s a part of open source, which means Spark SQl, Spark Streaming, Spark MLib, and GraphX are part of what’s available. Additional algorithms available include timeseries and geospatial algorithms. It supports parquet columnar encryption and data skipping, as well as open source projects such as Yarn, Mesos, Standalone Scheduluer and Kubernetes.

The Spark Offering in IBM Cloud Pak for Data is also known as Analytics Engine powered by Apache Spark/IBM Watson Studio. Spark environments offer additional algorithms built specially to do analytics in parallel and in memory. Some of these additional algorithms in the Analytics Engine are for working specifically within the geospatial domain. These algorithms have been built in collaboration with IBM Research and can be used for solving many analytical use cases.

You can use the geospatio-temporal library to expand your data science analysis to include location analytics by gathering, manipulating and displaying imagery, GPS, satellite photography and historical data.

You can use the geo-spatio-temporal library in Cloud Pak for Data in the following environments:

Run Spark jobs on your Cloud Pak for Data cluster by using the Spark jobs REST APIs of Analytics Engine powered by Apache Spark. See Getting started with the library.
Run notebooks in Spark environments in Watson Studio. See Geospatio-temporal library for notebooks.

Use case

The data that we will use is made available by the New York City Taxi and Limousine Commission (NYC Taxicab Trips Data). It consists of taxicab trip pick-up and drop-off dates and times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. We will use data from the first week of December 2015 for analysis in this notebook. The NYC Yellow Taxicab data for December 2015 can be downloaded from here .The NYC Point of Interest data can be downloaded from here. Once the two datasets are available on IBM Cloud Object Storage, they can be used with this notebook. Please provide the dataset URLs as input parameters when running the cell. In order to demonstrate the usage of the Geospatial APIs using less compute resources, one can limit the input data size. In order to only get the first week of December 2015 data from the downloaded file, simply run the following shell command:

grep 2015–12–0[1–7] yellow_tripdata_2015–12.csv > yellow_tripdata_2015–12_week1.csv

Example code used in this story can be downloaded from here

Geospatial Algorithm in Analytics Engine- Getting started

Before you can start using the library in an application, you must register STContext to access the st functions.

Register STContext:
from pyst import STContext
# Register STContext, which is the main entry point
stc = STContext(spark.sparkContext._gateway)
spark._jvm.org.apache.spark.sql.types.SqlGeometry.registerAll(spark._jsparkSession)

The STC code as noted above — This show’s ST Context Creation in a notebook

2. Create the Spark data frame for the input data:

Loading data into pandas with multiple fields. — This show’s creation of spark data from and adding schema to it

PS: you can create the Spark data frame directly no need to create it from the Pandas dataframe

3. Create a view to run queries named “yellowtaxi”

Query :

SELECT *, ST_Point(pickup_longitude, pickup_latitude) as pickuplocation from yellowtaxi

a screenshot of a Jupyter notebook on loading ‘yellowtaxi’ queries

Technique used the following scenarios

We are leveraging geohash encoding and decoding here to achieve the purpose of aggregation. Geometries (e.g. pickup and drop-off locations) are encoded into geohashes and can be used for aggregation, and geohashes can then later on be decoded back to geometries to get meaningful locations (e.g. center of hotspot). The granularity of the aggregation is controlled by geohash depths. You can find more details on the geospatial sql functions used in the above query here

Geospatial Algorithm in Analytics Engine- simple analytics

Scenario: We want to get the top pickup and drop points for Yellow Taxi in New York City from our data.

Query for top pickup points:

SELECT avg(PASSENGER_COUNT), avg(TRIP_DISTANCE), avg(FARE_AMOUNT), avg(TIP_AMOUNT),
ST_X(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) AS lon, ST_Y(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) AS lat,
count(*) AS COUNT FROM (SELECT PASSENGER_COUNT,TRIP_DISTANCE,FARE_AMOUNT, TIP_AMOUNT, cast(ST_ContainingGeohash(ST_Point(PICKUP_LONGITUDE, PICKUP_LATITUDE), 300)
AS string) AS geohash FROM yellowtaxi WHERE PICKUP_LONGITUDE < -60 AND PICKUP_LATITUDE > 30 AND DROPOFF_LONGITUDE != PICKUP_LONGITUDE 
AND DROPOFF_LATITUDE != PICKUP_LATITUDE) GROUP BY geohash ORDER BY COUNT DESC LIMIT 50

Notebook code for creating spark dataframe for top pickup points

2. Display the top pickup points on a map

map of New York City showing top pick up points — This shows the top pick up points plotted from the Spark dataframe created using geospatial functions in Spark SQL

3. Query for top drop points

SELECT avg(PASSENGER_COUNT), avg(TRIP_DISTANCE), avg(FARE_AMOUNT), avg(TIP_AMOUNT),
                           ST_X(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) AS lon,
                           ST_Y(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) AS lat,count(*) AS COUNT 
                           FROM (SELECT PASSENGER_COUNT,TRIP_DISTANCE,FARE_AMOUNT, TIP_AMOUNT, 
                           cast(ST_ContainingGeohash(ST_Point(DROPOFF_LONGITUDE, DROPOFF_LATITUDE), 300) AS string) AS geohash
                           FROM yellowtaxi WHERE DROPOFF_LONGITUDE < -60 AND DROPOFF_LATITUDE > 30 AND DROPOFF_LONGITUDE != PICKUP_LONGITUDE 
                           AND DROPOFF_LATITUDE != PICKUP_LATITUDE) GROUP BY geohash ORDER BY COUNT DESC LIMIT 50

This shows top drop off points spark dataframe creation using geospatial functions in Spark SQL

4. Display the top drop points on a map

map of New York City showing top drop off points — This shows the top drop off points plotted from the Spark dataframe created using geospatial functions in Spark SQL

Geospatial Algorithm in Analytics Engine- advanced analytics

Scenario: Let’s say as a taxi company, you want to analyze what are popular weekend spots or when is the maximum demand for taxis in weekdays so that you can plan for the number of required taxis based on the analysis.

Query to get the demand of taxis on an hourly basis in a weekday

select hour(tpep_dropoff_datetime), count(*) from yellowtaxi where cast(tpep_dropoff_datetime as date) BETWEEN cast(‘2015–12–01’ as date) and cast(‘2015–12–04’ as date)
and hour(tpep_dropoff_datetime) between 6 and 19 AND DROPOFF_LONGITUDE < 0 AND DROPOFF_LATITUDE > 0 AND
DROPOFF_LONGITUDE != PICKUP_LONGITUDE AND DROPOFF_LATITUDE != PICKUP_LATITUDE group by hour(tpep_dropoff_datetime)

This shows weekday demand dataframe creation using geospatial functions in Spark SQL

2. Display the demand using a graph

Bar graph showing when top demand is for taxis (at 7pm) — Above diagram shows the **maximum demand for taxis in weekdays**

3. Query to get the top weekend spots

select ST_X(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) as lon,
ST_Y(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) as lat, count(*) as count
from (select cast(ST_ContainingGeohash(ST_Point(PICKUP_LONGITUDE, PICKUP_LATITUDE), 100) as string)
as geohash from yellowtaxi where cast(tpep_dropoff_datetime as date) BETWEEN cast('2015-12-01' as date) and cast('2015-12-04' as date)
and (hour(tpep_pickup_datetime) > 20 or hour(tpep_pickup_datetime) < 6) AND PICKUP_LONGITUDE < 0 AND PICKUP_LATITUDE > 0 AND
DROPOFF_LONGITUDE != PICKUP_LONGITUDE AND DROPOFF_LATITUDE != PICKUP_LATITUDE) GROUP BY geohash ORDER BY count
DESC LIMIT 50

This is a Jupyter notebook snapshot showing creation of Spark data from to get popular weeked spots

4. Display the top weekend spots on a Map

map of New York City showing top cab spots on a weekend. — This is a snapshot from a Jupyter notebook showing how we can use the spark dataframe to plot the popular weekend spots on a Map

Summary

This story and examples shown are meant only as a primer to using geospatial functions with Analytics Engine. You can go through our other examples and other value adds supported in Analytics Engine here. We encourage you to give Analytics Engine a try and to discover new ways to query geospatial data.

This blog is written in collaboration with folks from IBM Research — Raghu, Linsong Chu, Mudhakar, Pranita and Intern — Rachna Jain from IBM India Software Labs

Please reach out to me via LinkedIn if you have any feedback or comments.