Innovation & Analytics Lab Data Science

Andrew Howe
Wood Mackenzie
Published in
7 min readJul 25, 2022

I’m a Data Scientist in the Innovation & Analytics Lab at Wood Mackenzie. What this means is that I get to spend my days working on an amazing variety of fascinating and impactful projects. I’d like to share details about one of them, but first, you may be asking what a data scientist is.

The definition I use (though I did not create it) is that a data scientist is someone who brings to the table:

  • skills in computational and technological tools
  • knowledge of mathematics and statistics
  • substantive domain expertise
Venn of Data Science

The data scientist applies this triad of skills to perform statistical analysis and mathematical modeling of data to solve a problem or answer a question. While there are likely many variations on this definition, they will tend to be broadly the same.

This sounds vague, because data scientists can work on a tremendous array of challenges in nearly every industry:

  • electric utilities: predicting real-time power demand
  • marketing: optimizing ad placement
  • oil & gas: predicting how best to fracture stimulate a well
  • media: building recommendation engines
  • pharmaceuticals identifying promising medicinal compounds

I joined Wood Mackenzie in 2018, as a founding member of the Innovation & Analytics Lab. Having been promoted to Director, I guide the analytics activities of our growing team of data scientists, set our annual L&D program, and lead our hiring — in addition to working on my own projects. In all of my projects, I learn something new; domain knowledge, technologies, and / or new ways of doing things. During my time with Wood Mackenzie, I’ve added skills in the following technologies:

  • geospatial data analysis and processing
  • data pipeline orchestration with Apache Airflow
  • containerization with Docker
  • cluster computing with Apache Spark
  • interactive data visualization with Plotly
  • simple GUI development with Dash

We are an R&D team, acting as collaborative consultants to any business unit which has an impactful problem that may be solved by analytics. We generally work on proof-of-concepts (PoC). Potential projects are evaluated via a set of criteria — both quantitative and qualitative. To accept a project which passes our criteria, we require the business unit to commit an SME to work on the project on at least a part-time basis, and we also require a plan to handover and implement successful project results.

Now that I’ve cleared up what a data scientist does, and briefly described how my team works, let’s discuss one of my most interesting recent projects. I’ll start by defining the challenge, relevant available data, and the hypothesis driving the solution.

Port Congestion PoC Introduction

Challenge: Shipping ports around the world are becoming congested (see e.g. here, here, and here). As larger vessels are becoming more common, and carrying more shipping containers, they spend more time in berths. The queues of incoming vessels waiting at the ports are subsequently increasing in length. This causes delays in supply chains, and probably also has negative environmental impacts. This PoC focused on the Los Angeles and Long Beach ports — one of the busiest container port areas in the USA. The plot here shows average monthly waiting & berthing durations for these ports combined.

In-box Durations

Data Historical real-time vessel positioning AIS (automatic identification system) data for all vessels worldwide, along with polygon shapes to identify port berths (shown here).

Los Angeles / Long Beach Ports

Hypothesis Using the specified data, I should be able to develop a Monte Carlo simulation-based model to realistically simulate and forecast shipping port activity. In backtesting, some measure of central tendency of simulated berthing durations should be close to actual durations.

In this, one of my most interesting projects, I analyzed real-time vessel location data to simulate activity and forecast congestion in shipping ports. This project, completed in 2 months, was very exciting for several reasons. The subject domain was completely new to me, I developed familiarity with several new technologies, and it is particularly important — given the global maritime supply chain blockages. I built a set of Apache Airflow pipelines to obtain real-time vessel location data from a set of web APIs, store the data in a PostgreSQL database, and perform extensive data processing — all running on a single AWS EC2 instance.

Data Processing

The first substantial challenge was to effectively and efficiently compute historical port calls (when a vessel goes to a berth) and berthing durations. This was a challenge for several reasons: there is a lot of data, it’s quite noisy, and some berths are very close to each other. The data acquisition pipeline polls the APIs every 15 minutes, truncates each datapoint to half-hour increments, and takes the latest truncated positions. Several other data cleansing steps are followed. To evaluate thousands of geospatial coordinates for spatial inclusion in several berth polygons is a big task, and could not be done in python without extracting and iterating over data from the database. Using the principle of “keep as much computation on a database server as possible”, I developed a second pipeline that uses the PostGIS extension to efficiently compute when vessels are in berths. Because some berths are very close, a vessel headed to berth A could sail through the polygon of berth B, making it seem that the vessel berthed in both sequentially. The pipeline goes through a secondary phase of evaluating berth assignments to clean up these and related issues. After all this, it is relatively straightforward to compute berthing durations for each port call.

Port Activity Simulation and Forecasting — “Simulcasting”

The port simulcasting algorithm starts from a specified timestamp, determining the vessels in berth, waiting at the ports, and currently steaming to the ports. Recent history of all vessel berth visits and berthing durations data is gathered. This data is used for two purposes. Firstly, the algorithm identifies best-fitting probability distributions for the berthing durations. Secondly, since I didn’t know a priori vessel berth assignments, the algorithm predicts the destination berth as whichever has been visited most frequently in recent history.

Port activity is then simulated using a set of FIFO queues. There is a queue for each berth, and a separate queue for each port. This latter is to handle vessels for which I couldn’t predict a destination berth. When a vessel enters a berth, I statistically simulate the berthing duration by drawing a random sample from the appropriate best-fitting distribution. The port activity simulation is replicated many times, and the waiting & berthing duration empirical distributions are computed. Results for a sample vessel are shown here.

Sample Results

In this example, the vessel was expected to wait about 270 hours, but then berthing could take between almost 3 and about 11 days.

Results Evaluation

Results Evaluation

When the port simulcasting algorithm is run with a timestamp in the past, I can evaluate its accuracy by comparing the waiting & berthing duration empirical distributions against the actual observed durations. To do so for a specified execution timestamp, I:

  • match simulated and actual port calls (a difficult fuzzy match)
  • compute median simulated berthing durations (by vessel)
  • compute errors between median and actual berthing durations
  • compute empirical distributions of errors

With data since the start of 2020, I ran 500 replications of the algorithm as of the first of each month between March 2020 and December 2021. As can be seen above, out of these 22 runs, only once did the median berthing duration error exceed a day (berthing durations tended to be underestimated).

Epilogue

As I stated, this was a very interesting project for three reasons: the subject domain was completely new to me, I developed familiarity with several new technologies, and it is particularly important. With the extensive database of vessel AIS data, available from our VesselTracker product, Wood Mackenzie is uniquely positioned to help shipping companies solve very important problems around optimizing their supply chain logistics, given port constraints. Several major carriers have expressed strong interest in a potential product built on this PoC, and have even offered to partner with Wood Mackenzie on this challenge.

Wood Mackenzie’s data organization is always looking for qualified individuals with a variety of relevant skillsets. Indeed, the Innovation and Analytics lab is currently attempting to fill several roles at various levels of seniority. If this project sounds like something you’d like to spend your days doing, feel free to reach out to us.

--

--