Creating a Spark Streaming ETL pipeline with Delta Lake at Gousto

This is how we reduced our data latency from two hours to 15 seconds with Spark Streaming.

André Sionek
Gousto Engineering & Data

--

We used to get files from the software that controls Gousto’s factory once a day via an SFTP server: several CSV files containing atomic data for each box that went through the production line on the previous day, such as the timestamps of when a box starts and exits the line. This data was used by Gousto's operations teams to measure our Supply Chain performance and detect issues on production lines.

We had an ingestion pipeline composed of a Lambda Function that moves files from the SFTP server to our Data Lake in S3 plus a job triggered by Airflow on EMR. The whole pipeline was ingesting CSVs, applying some simple transformations, saving table as Parquet and exposing data to users with Redshift Spectrum.

Our old pipeline to publish factory data to end-users. Gousto is a food company, so we call or data lake layers: Raw, Cooked and Served. Diagram by the author.

The EMR job alone was taking about two hours to complete. But since it was processed overnight, fresh data was available every morning for reporting.

Then everything changed.

Files started arriving in 15 minutes chunks, instead of once a day. This was an old request from operations that was finally being fulfilled by the factory software. Of course, we immediately got a request to…

--

--

André Sionek
Gousto Engineering & Data

I write about data and software engineering, career and everything in between!