How Windward Leverages lakeFS For Resilient Data Ingestion

Implementing CI/CD-inspired workflows built atop lakeFS operations prevents inconsistent data and brings increased reliability to our analytics platform.

Lior Resisi
Data rocks

--

Photo by Cameron Venti on Unsplash

At Windward, our Maritime Artificial Intelligence Analytics (MAIA™️) platform delivers predictive intelligence on global maritime conditions to hundreds of businesses. Customers across different industries — like oil & energy, commodity trading, financial institutions, and more — leverage our platform to optimize operations and mitigate risk at sea.

Thanks to big data processing, the MAIA™️ platform can aggregate over 30 unique sources, consisting of billions of data points. This includes both proprietary and open-source data, including AIS, GIS layers, weather conditions, satellite images, and nautical charts. This information is processed to serve as inputs to proprietary algorithms that accurately determine vessel identity, vessel location, cargo visibility, voyage patterns, and more.

Underpinning the platform are scalable, resilient data pipelines that incorporate and process the data sources mentioned above. If these pipelines were to fail or show incorrect numbers, the fallout could cost our customers and their business millions of dollars. Therefore, our pipelines need to be fault-tolerant. To understand how we build Windward’s pipelines in such a way, let’s dive a bit deeper.

Workflow Before lakeFS

The first step for data entering our platform is to land in S3, separated by hourly partitions according to when it was received. This is often, but not always, the same as when the actual events occurred.

In the case of late-arriving events, it is necessary to move the data to the correct partition according to when it actually took place. To do this, we run a separate process that looks at the events contained within a file and if needed, move them to the correct partition.

Sounds simple, right? Unfortunately, it is not so simple to get the type of transactional guarantees we would like for this copy-and-delete operation on an object store like S3. For example, how do you recover when the copy operation succeeds but then the delete of the original file fails? And even more concerning, how to prevent a downstream job from reading the ingested dataset right after a copy but before the delete?

Achieving Isolation and Atomicity with lakeFS

Luckily we learned about lakeFS and how it can provide isolation and transactional guarantees for operations over an object store.

After deploying lakeFS in our data environment and creating a repository, the process for data ingestion now looks like the diagram below.

By utilizing lakeFS branches and commits, we can guarantee new data gets moved to the correct partition without interfering with downstream consumers of the data.

lakeFS Exports For Incremental Adoption

While leveraging a lakeFS repository and its related operations works great for many of our datasets, we didn’t want to be dependent on repositories for every dataset. Or more accurately, we wanted the option for some datasets to be referenced by their normal S3 prefix.

To get the best of both worlds, we made use of the lakeFS export operations as a final step in our jobs. This allows for copying all data from a given lakeFS commit to a designated S3 path. Other applications down the stack can read directly from the exported S3 location, without the need to be familiar with lakeFS.

Wrapping Up

Since introducing lakeFS to our production data environment, we’ve enjoyed the benefits of atomic and isolated operations in our data pipelines. This has allowed us to spend more time improving other aspects of our data platform, and less time dealing with the fallout from race conditions and partially failed operations.

Further Reading

Learn more about Windward and the Maritime Artificial Intelligence Analytics Platform on Windward Blog.

Learn more about lakeFS and how it’s transforming modern data lakes on the lakeFS website.

--

--