Analyzing Geospatial data in Apache Spark
Geospatial data plays a crucial role in data forecasting, spatial analytics, and reporting, especially in the fields of logistics and finance. Additionally, there are increasingly more use cases where technologies like artificial intelligence have been used in tandem with geospatial applications. The purpose of this blog is to share a scenario where you use the Geospatial algorithm in Apache Spark.
Using Apache Spark, which is an open source, scalable, massively parallel, in-memory execution engine, you can build analytics applications which can include prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala.
The Spark Offering in IBM Cloud Pak for Data is also known as Analytics Engine powered by Apache Spark/IBM Watson Studio. Spark environments offer additional algorithms built specially to do analytics in parallel and in memory. Some of these additional algorithms in the Analytics Engine are for working specifically within the geospatial domain. These algorithms have been built in collaboration with IBM Research and can be used for solving many analytical use cases.
You can use the geospatio-temporal library to expand your data science analysis to include location analytics by gathering, manipulating and displaying imagery, GPS, satellite photography and historical data.
You can use the geo-spatio-temporal library in Cloud Pak for Data in the following environments:
- Run Spark jobs on your Cloud Pak for Data cluster by using the Spark jobs REST APIs of Analytics Engine powered by Apache Spark. See Getting started with the library.
- Run notebooks in Spark environments in Watson Studio. See Geospatio-temporal library for notebooks.
Use case
The data that we will use is made available by the New York City Taxi and Limousine Commission (NYC Taxicab Trips Data). It consists of taxicab trip pick-up and drop-off dates and times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. We will use data from the first week of December 2015 for analysis in this notebook. The NYC Yellow Taxicab data for December 2015 can be downloaded from here .The NYC Point of Interest data can be downloaded from here. Once the two datasets are available on IBM Cloud Object Storage, they can be used with this notebook. Please provide the dataset URLs as input parameters when running the cell. In order to demonstrate the usage of the Geospatial APIs using less compute resources, one can limit the input data size. In order to only get the first week of December 2015 data from the downloaded file, simply run the following shell command:
grep 2015–12–0[1–7] yellow_tripdata_2015–12.csv > yellow_tripdata_2015–12_week1.csv
Example code used in this story can be downloaded from here
Geospatial Algorithm in Analytics Engine- Getting started
Before you can start using the library in an application, you must register STContext
to access the st
functions.
Register
STContext
:
from pyst import STContext
# Register STContext, which is the main entry point
stc = STContext(spark.sparkContext._gateway)
spark._jvm.org.apache.spark.sql.types.SqlGeometry.registerAll(spark._jsparkSession)
2. Create the Spark data frame for the input data:
PS: you can create the Spark data frame directly no need to create it from the Pandas dataframe
3. Create a view to run queries named “yellowtaxi”
Query :
SELECT *, ST_Point(pickup_longitude, pickup_latitude) as pickuplocation from yellowtaxi
Technique used the following scenarios
We are leveraging geohash encoding and decoding here to achieve the purpose of aggregation. Geometries (e.g. pickup and drop-off locations) are encoded into geohashes and can be used for aggregation, and geohashes can then later on be decoded back to geometries to get meaningful locations (e.g. center of hotspot). The granularity of the aggregation is controlled by geohash depths. You can find more details on the geospatial sql functions used in the above query here
Geospatial Algorithm in Analytics Engine- simple analytics
Scenario: We want to get the top pickup and drop points for Yellow Taxi in New York City from our data.
- Query for top pickup points:
SELECT avg(PASSENGER_COUNT), avg(TRIP_DISTANCE), avg(FARE_AMOUNT), avg(TIP_AMOUNT),
ST_X(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) AS lon, ST_Y(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) AS lat,
count(*) AS COUNT FROM (SELECT PASSENGER_COUNT,TRIP_DISTANCE,FARE_AMOUNT, TIP_AMOUNT, cast(ST_ContainingGeohash(ST_Point(PICKUP_LONGITUDE, PICKUP_LATITUDE), 300)
AS string) AS geohash FROM yellowtaxi WHERE PICKUP_LONGITUDE < -60 AND PICKUP_LATITUDE > 30 AND DROPOFF_LONGITUDE != PICKUP_LONGITUDE
AND DROPOFF_LATITUDE != PICKUP_LATITUDE) GROUP BY geohash ORDER BY COUNT DESC LIMIT 50
2. Display the top pickup points on a map
3. Query for top drop points
SELECT avg(PASSENGER_COUNT), avg(TRIP_DISTANCE), avg(FARE_AMOUNT), avg(TIP_AMOUNT),
ST_X(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) AS lon,
ST_Y(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) AS lat,count(*) AS COUNT
FROM (SELECT PASSENGER_COUNT,TRIP_DISTANCE,FARE_AMOUNT, TIP_AMOUNT,
cast(ST_ContainingGeohash(ST_Point(DROPOFF_LONGITUDE, DROPOFF_LATITUDE), 300) AS string) AS geohash
FROM yellowtaxi WHERE DROPOFF_LONGITUDE < -60 AND DROPOFF_LATITUDE > 30 AND DROPOFF_LONGITUDE != PICKUP_LONGITUDE
AND DROPOFF_LATITUDE != PICKUP_LATITUDE) GROUP BY geohash ORDER BY COUNT DESC LIMIT 50
4. Display the top drop points on a map
Geospatial Algorithm in Analytics Engine- advanced analytics
Scenario: Let’s say as a taxi company, you want to analyze what are popular weekend spots or when is the maximum demand for taxis in weekdays so that you can plan for the number of required taxis based on the analysis.
- Query to get the demand of taxis on an hourly basis in a weekday
select hour(tpep_dropoff_datetime), count(*) from yellowtaxi where cast(tpep_dropoff_datetime as date) BETWEEN cast(‘2015–12–01’ as date) and cast(‘2015–12–04’ as date)
and hour(tpep_dropoff_datetime) between 6 and 19 AND DROPOFF_LONGITUDE < 0 AND DROPOFF_LATITUDE > 0 AND
DROPOFF_LONGITUDE != PICKUP_LONGITUDE AND DROPOFF_LATITUDE != PICKUP_LATITUDE group by hour(tpep_dropoff_datetime)
2. Display the demand using a graph
3. Query to get the top weekend spots
select ST_X(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) as lon,
ST_Y(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) as lat, count(*) as count
from (select cast(ST_ContainingGeohash(ST_Point(PICKUP_LONGITUDE, PICKUP_LATITUDE), 100) as string)
as geohash from yellowtaxi where cast(tpep_dropoff_datetime as date) BETWEEN cast('2015-12-01' as date) and cast('2015-12-04' as date)
and (hour(tpep_pickup_datetime) > 20 or hour(tpep_pickup_datetime) < 6) AND PICKUP_LONGITUDE < 0 AND PICKUP_LATITUDE > 0 AND
DROPOFF_LONGITUDE != PICKUP_LONGITUDE AND DROPOFF_LATITUDE != PICKUP_LATITUDE) GROUP BY geohash ORDER BY count
DESC LIMIT 50
4. Display the top weekend spots on a Map
Summary
This story and examples shown are meant only as a primer to using geospatial functions with Analytics Engine. You can go through our other examples and other value adds supported in Analytics Engine here. We encourage you to give Analytics Engine a try and to discover new ways to query geospatial data.
This blog is written in collaboration with folks from IBM Research — Raghu, Linsong Chu, Mudhakar, Pranita and Intern — Rachna Jain from IBM India Software Labs
Please reach out to me via LinkedIn if you have any feedback or comments.