An Introduction to Data Science at project44

Published in

project44 TechBlog

6 min readJul 5, 2022

Supply chain is a truly global industry which has been in the limelight quite often in the last couple of years. There have been a number of major events — Covid-19 and its downstream effects (including the congestion at the ports on the west coast of US), Ever Given getting stuck in the Suez Canal, and Russia attacking Ukraine to name a few — that have impacted our lives in many ways, including the way goods move around the world. And yet, even with this increased attention on global supply chains, it’s still very difficult to make sense of these interconnected systems and how they might behave in the future. And that is precisely why I was attracted to this industry in the first place.

Working as a data scientist at project44 is truly exciting. We are in a unique position of having a direct line of sight into how the supply chains of the world operate. Every day, I am amazed at the breadth of interesting problems we come across, and the amount of value we can unlock for our customers and the whole ecosystem in general. There is no other company that has the breadth and depth of the data we have — we can see the global supply chain in motion.

Let me ground my excitement in an example and tell you a little bit more about the data we get to work with at p44.

Imagine a hypothetical footwear company, XYZ Kicks, that wants to track shipments of its newest release of sneakers from a manufacturing plant in southeast Asia to one of their distribution centers in Memphis. For now, we can ignore how the raw material got to the manufacturing plant in the first place, or what happens to those shoes once they get to the warehouse in Memphis. (Although we track those movements too!)

These newly manufactured shoes would get stuffed into a container at the plant. That container would get loaded onto a truck and be taken to a nearby port, where it would get loaded onto a container vessel. It would then travel thousands of miles across the ocean to a port on the west coast of the US. It would then get loaded onto a rail car and travel across the US to a rail yard outside of Memphis, transfer onto a truck, then finally make it to the intended warehouse in Memphis. How would we track this container? What does this data look like? Can we provide a reliable ETA for the container as it moves across the different stages of its journey to the final warehouse in Memphis?

Path taken by a container from a plant in Vietnam to DC in Memphis

Let’s step back for a moment. When we bring on XYZ Kicks as a new customer, we first set up integrations to take in their shipment data from their contracted ocean, air, rail, and truckload carriers. This data would include container event data and shipment metadata, which tells us everything that happens to every single container, and it gives us information such as pricing, the parties involved, and the commodities being shipped, to name just a few. Independently, we also ingest live vessel position data (collected via satellites and terrestrial stations), giving us coverage of all vessels around the world in real time. We take in vessel schedules, which tell us what rotations vessels belong to and which ports vessels plan to visit. And we take in more supply-chain specific data as well.

We are simultaneously connected to and are ingesting data from hundreds of sources at any moment. Each of these sources presents their own unique challenges. If you think there is a set of agreed-upon standards to communicate supply chain information, think again. Even in cases where there are standards, they are more treated as “pseudo-standards”: each data provider has their own nuances and adds a unique touch to the way they operate and communicate with us and other parties.

The challenge is to take in all this out-of-order, duplicated, and conflicting data with bad timestamps or completely missing events, and make sense of what is actually happening to every shipment in a holistic and scalable way so when a customer is trying to understand where their stuff is and when it is going to get to its next major milestone, we have a coherent story to tell.

This is a big challenge for each source in isolation, and it gets even more complex when you begin to blend these different sources of data together, because they do not always agree with each other. For example, a stream of container event data from an ocean carrier may tell you that a vessel carrying a container is arriving at its destination port in two days. At the same time, the latest update to that vessel’s schedule shows there will be an extra port call before getting there. And the vessel satellite information might say the vessel is about a day away from the port. Which one is to be trusted? What should our arrival prediction be? And how should we relay all this complex information and logic in an easily understandable way to our customers? These problems are critical for our customers, and they are extremely difficult to solve. They have plagued the industry for quite some time and require a marriage of deep industry and deep technical expertise to get them right.

Putting data quality aside for a moment, each of these data sets presents other unique challenges as well. For example, with the vessel position data, we are dealing with billions of rows of geospatial data, growing at a rapid pace every hour of every day. Ingesting this data in an efficient manner, detecting and removing outliers, and down-sampling that data without loss of information are just the few initial steps we are involved with as data scientists.

With vessel schedules, we receive daily updates on what the port call plans for vessels are. Reconciling what has happened in the past with what the plan is in the future and accounting for all the changes in schedules is not as straightforward as it seems. And it gets even more complex to build a single timeline for every vessel when we receive multiple schedules for the same vessel from different sources, and those schedules do not match — different sources often publish different sets of stops, and even when the locations match, the arrival and departure times could be missing or completely different.

And of course, one of the main products we own as part of the Data Science team at p44 is predicting when containers will be delivered to their final destination. We have several models we maintain and continue to iterate on as we ingest more data and learn more about how best to model the supply chain industry.

This of course is just a glimpse into one of the many types of problems we get to tackle at p44. Every day, I am amazed at the breadth of interesting problems we come across, and the amount of value we can unlock for our customers and the whole ecosystem in general.

As data scientists at p44, we are in the unique position to work with the most comprehensive supply chain datasets ever put together and build models that paint a unique view of how this incredibly complex industry operates. This, to me, is one of the main reasons this experience is so rewarding: not only do we get to unlock a tremendous amount of value for our customers, but we also get to do it by solving really interesting and challenging problems that no one has solved before. And while we have made great strides and built many great products so far, we feel we are only 1% of the way there.

An Introduction to Data Science at project44

Written by Milad Davaloo