Automated Identification Systems: Transformation of Data using Pandas 2.0

4 min readOct 15, 2023

Due to the inclusion of Apache’s PyArrow, Pandas 2.0 is a significant performance upgrade over Pandas 1.0. For example, strings can be handled ten times faster (Rashcha, 2023). In this article we look at the implications of this change when transforming automated information system (AIS) data.

Introduction

In 2023, over 100,000 commercial vessels over 100 gross tons (UNCTAD, 2023) ply the world’s waterways. Most of these vessels are equipped with AIS-satellite transmitters that send packets of data in as little as two second intervals (IMO, 2015) which results in a daily worldwide accumulation of 38 gigabytes of automated identification system (AIS) data (Spire, 2023).

Managing this information is an exercise in big data. Typically AIS data collection, using radio signals, is facilitated using a National Marine Electronics Association (NMEA) data standard. The NMEA structure is then converted and stored in a repository, making it accessible as an extensible markup language (XML) or similar structured object.

At this point, transforming the data into something manageable can be costly in terms of processing power. Inclusion of PyArrow into Pandas 2.0 has altered the landscape somewhat. A precise understanding of benchmarking as it relates to PyArrow can be found here.

Method

Step 1: The Application Program Interface Call

When making an API call, generally data is delivered as extensible markup language (XML), JavaScript Object Notation (JSON), or comma-separated value (CSV). JSON is better for nested data. XML is more complex and slower to parse.

AIS data is not complicated. Thus, CSV is likely the quickest output format for handling within Python.

On the author’s local machine, reading a 5 MB CSV file was six times faster than reading an XML file, and three times faster reading a JSON file. Benchmarking was performed in a casual way is in no way representative of a proper profiling effort.

Step 2: Convert the CSV File to a Pandas Dataframe

First, import Pandas and ensure that you are using Pandas 2.0 and not prior versions.

import pandas as pd

print(pd.__version__)

Then, convert the API’s CSV file to Feather. According to GPT 4.0:

Pandas Feather is a file format used for efficiently storing and exchanging data between different Python programs, particularly those that use the Pandas library for data manipulation. It allows for faster read and write operations compared to formats like CSV or JSON. You can use the Feather format to save Pandas DataFrames to disk and then load them back into your Python code quickly.

Thus:

import pandas as pd

# Read the CSV file
df = pd.read_csv('AIS6Min-2023-10-08.csv')

# Save it as a Feather file
df.to_feather('AIS6Min-2023-10-08.feather')

At this point we can make full use of Apache’s PyArrow.

Step 3: Minimize Unused Data Types

AIS data types are many. Some common examples:

mmsi             int64
tstamp          object
latitude       float64
longitude      float64
cog            float64
sog            float64
heading        float64
navstat        float64
imo              int64
name            object
vessel_type      int64
draught        float64
dest            object
eta             object

I recommend when first handling large amounts of AIS data that the feather file be filtered to only represent:

MMSI (int64)

timedate (object)

latitude (float64)

longitude (float64)

speed over ground (float64)

Timedate is an object, but a must for geospatial analysis. This data can be expanded in many interesting ways, which I may tackle in future articles.

I personally create a utilities package, which can then be used to remove costly data types:

df = aisutils.filter_headers(df, ['mmsi', 'tstamp', 'latitude', 'longitude', 'sog']) #filter feather

At this point the data is transformed and filtered, and ready for processing.

Conclusion

Apache’s PyArrow, released in 2023, has significantly altered the landscape of AIS processing due, in part, to the ability to handle costly data types. The package also speeds up FLOAT64, they data type on which Pandas, through NumPy, was built to handle.

Quickly processing AIS data has industry implications. Some examples include faster decision-making, improvement of customer experience, and enhanced analytics. In future articles, we will explore in an explicit manner how these broad terms might be accomplished in an explicit and value-driven manner.

References

(n.d.). OpenAI. Retrieved October 15, 2023, from https://openai.com/

AIS data API (XML / JSON / CSV Webservice). (n.d.). AISHub. Retrieved October 15, 2023, from https://www.aishub.net/api

Benchmarking the new Pandas PyArrow Backend. (n.d.). GitHub. Retrieved October 15, 2023, from https://github.com/rasbt/machine-learning-notes/blob/main/benchmark/pandas-pyarrow/pandas2-pyarrow.ipynb

PyArrow — Apache Arrow Python bindings — Apache Arrow v13.0.0. (n.d.). Apache Arrow. Retrieved October 15, 2023, from https://arrow.apache.org/docs/python/index.html

Raschka, S. (2023, March 4). Pandas 2.0. Retrieved October 15, 2023, from https://twitter.com/rasbt/status/1632090412117532672