Leveraging Geospatial Framework in Big Data Analytics

One of the efforts that can support Sustainable Development Goals through data management

Published in

ZebraX

8 min readMar 11, 2022

There are many ways that industries can do to support Sustainable Development Goals (SDGs) set by the United Nations (UN), in order to achieve the 2030 agenda. One of the efforts that can be done is by utilizing Geographic Information Systems (GIS) in Big Data Analytics. GIS can help to support Sustainable Development Goals through data management, analysis and understanding, increased collaboration, sustainable design and planning, and sound decision making.

The need for a big data environment in geospatial analytics

Today, the volume of geospatial data increases drastically and faster than ever. As we all already know, performing geospatial analysis may call for special tools to analyze spatial data to generate insights. Many users use classic geographic information systems (GIS), specialized software systems for storing, managing, analyzing, and visualizing geospatial data. The classic GIS is not designed for this huge data, it is primarily designed for data that fit in a single machine.

One of the main reasons why today, we have a huge amount of geospatial data, and still generating new data every day, is because of the new trend and advanced technology in mobile devices and geolocation sensors. We can find them everywhere from our devices to industrial equipment and vehicles. All of them are generating geospatial data (Spatio-temporal data).

This huge amount of geospatial data is a great potential of valuable information that can be used for any kind of application: industry, environment, retail, and many more. Below are examples of geospatial applications that have already been implemented widely:

Climate modeling and analysis
Urban planning
Consumer insights
Route optimization
Insurance and Fraud analysis

The massive development of distributed, parallel computing with commodity hardware or cloud enables access to more computing power that is reliable, scalable, easy to launch, and cheaper. The big data platforms, with all of their potential, now pave the way for geospatial big data analytics.

Meet Apache Sedona

The Apache Software Foundation incubator runs a project to extend the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. This new platform is called Apache Sedona (formerly GeoSpark). Apache Sedona is created and architected by Dr. Mohamed Sarwat, an expert on database systems and spatial analytics who is an assistant professor of computer science at Arizona State University.

Apache Sedona is a cluster computing system for processing large-scale spatial data. Sedona provides a set of Spatial Resilient Distributed Datasets (SRDD) and Dataframes that can efficiently load, process, and analyze large-scale spatial data across a cluster of machines. Sedona provides Scala, SQL, Java, and Python APIs to program analytics pipelines on-top of geospatial data. Apache Sedona is a general-purpose geospatial data processing engine. Using Sedona, we can scale our geospatial works across many machines and we can run it everywhere in the cloud such as Amazon EMR, GCP data proc, Databricks, or on-premises Hadoop cluster.

A major advantage of Apache Sedona is its ability to load data into memory and physically split in-memory copies to a lot of equally sized partitions across work nodes of a cluster, while still preserving spatial proximity of the data. This is done by grouping spatial objects based on their spatial proximity. This leads to fast query speeds.

Supported Data Sources

Apache Sedona provides the possibility of loading the data from various data sources such as CSV, TSV, WKT, WKB, GeoJSON, Shapefile, PostGIS, spatial parquet, and NetCDF / HDF format. Spatial RDDs can accommodate seven types of spatial data including Point, Multi-Point, Polygon, Multi-Polygon, Line String, Multi-Line String, GeometryCollection, and Circle.

Application in Shipping Industry

For maritime operations, geolocation plays an important role. It is important in marine navigation for the ship’s officer to know the vessel’s position while in the open sea and also in congested harbors and waterways. While at sea, accurate position, speed, and heading are needed to ensure the vessel reaches its destination in the safest, most economical, and timely fashion that conditions will permit.

The implementation of tracking devices such as GPS, allows organizations to collect time and place based on geo-references in any event. By enabling organizations to link their asset location, it opens the opportunity to asset monitoring, generating insights, and building analytics models using the available data. The geospatial analysis uses the data to build maps, graphs, and statistics to make complex data more understandable.

In this article, we will look into geofencing application and vessel activity prediction use cases for maritime operations. The objective of geofencing is to give an alert when a vessel exits a virtual boundary set up around a geographical location. The second use case is aimed to help ship officers in making timesheets by predicting the vessel’s activity every 15 minutes, so it generates the timesheet automatically and the ship officer can focus on more important tasks.

Data visualization

The most interesting part of the geospatial analysis is that it is highly visual. So, let’s make some visualizations to get a better understanding of the data. The picture below is the route that will be assigned to the asset. In this example we show you 3 routes i.e. “Kalimantan 1 — Jawa 1”, “Sumatra 1 — Jawa 2”, and “Sumatra 1 — Jawa 3”. Normally, all vessels should follow these routes unless there are some conditions that do not allow them to.

*you can play with the map by zooming out/in to see a more detailed location of the data

From the data points distribution, it is apparent that the port area is denser than any other area. This is due to at a port, vessels are doing loading/unloading activity which takes some time to finish. So basically, we will have more data points around the port area with near-zero movement.

Another insight that we found from the map is that we have outliers around the South Sumatra area. Our vessels are not assigned to that area. It possibly has resulted from a glitch in our devices.

Geofencing

Deviation Measurement

To build an alert system, we need to specify the trigger to activate the system. In our case, the deviation distance from the original route should be a good metric, but sometimes deviation from the original route is inevitable for special cases like extreme weather conditions.

The picture below shows you the original routes overlayed with the vessel position and its color-coded deviation distance information.

From the 3 routes, the most interesting finding is in the “Kalimantan 1 — Jawa 1” route. The map shows high and frequent deviations from our vessels. From the pattern, the deviation is quite consistent and it creates a pattern resembling a new different route. This is the statistical value of the deviation.

Route name: “Kalimantan 1 — Jawa 1”
Average deviation (meter): 24760.914
Standard deviation (meter): 24840.5

Given the result, there is the possibility that we have a new route for the vessel that travels from “Kalimantan 1” to “Jawa 1”. This is a good finding. We can bring this information to the shipping company and tell them that their current system is possibly not updated for the new route.

Incorporating Time for Deviation measurement

As stated earlier, sometimes deviation is inevitable for a special case like weather conditions. If a captain decides to deviate from the original route, we may allow such conditions to happen as long as the ship returns once the weather is better. One simple idea to accommodate this is by incorporating the time-window variables into the alert system. We can dynamically set the time window to 30 minutes, 1 hour, or even more, as long as each stakeholder agrees upon the number. Within this time window, the alarm will not be set on, but after that, the alarm will notify as usual.

Ship Activity Prediction

The objective is clear and the solution seems simple. If we have the vessel’s location then we possibly know what activity is being carried out, since basically activities are bound to the geolocation references. It is uncommon to do loading/unloading activity in the middle of the ocean, we do this kind of activity at the port.

We have 36 different activities, and here is the scatter plot of vessel locations and their corresponding activities. Normally we expected some groups of activities with specific locations will emerge. But, no. It shows no pattern at all. It looks like a complete mess, and impossible to derive rules from that (using clustering technique nor human interpretation from the visualization).

Modeling approach

The case is a bit more complex than we thought, so we build a predictive model instead of deriving rules. The supervised technique is suitable here since we have all the labels. Below are the general steps that we followed in the modeling process.

We used 28 features including time-series features to capture the vessel’s movement in the last 2 hours. Highlighted with a yellow box below.

Having all the requirements, we built a predictive model using ZebraX advanced analytics platform (called ZX AnalytiX). No need to hard code the pipeline, just click and specify the algorithm, hyperparameters, feature engineering, and filtering process that we want to take.

Here is a screenshot of the experiment report of our modeling process using the Random Forest algorithm. The f1 score of our model is quite good for the first trial, 0.7. From the graph, it seems there is no indication of overfitting.

Summary

In the context of the geofencing system, this article shows you that we can complement the existing tracking system with additional logic that will give us an alert when our vessel is exiting our virtual boundary and also context-aware in which external information is rigorously analyzed before alerting the user, thus making the alarm smart.

The other potential that we show in this article is an automated report system that will help the ship officers to make a more consistent and accurate timesheet. All the time-consuming tasks in creating a report will be handled by a program, and they can focus on more important tasks.

👋 Thanks for reading.

Salam…

Reference

https://sedona.apache.org/

https://www.gislounge.com/geospatial-data-and-sustainable-development-goals/

Author

Gilang Samudra K. (Sr. Data Scientist)
Puteri Aulia Indrasti (Data Engineer)
Daniel Beltsazar M. (Data Scientist Intern)