This is how we reduced our data latency from two hours to 15 seconds with Spark Streaming.

We used to get files from the software that controls Gousto’s factory once a day via an SFTP server: several CSV files containing atomic data for each box that went through the production line on the previous day, such as the timestamps of when a box starts and exits the line. This data was used by Gousto's operations teams to measure our Supply Chain performance and detect issues on production lines.

We had an ingestion pipeline composed of a Lambda Function that moves files from the SFTP server to our Data Lake in S3 plus a job triggered by Airflow on EMR. The whole pipeline was ingesting CSVs, applying some simple transformations, saving table as Parquet and exposing data to users with Redshift Spectrum. …

Deploying Airflow on AWS is quite a challenge for those who don’t have DevOps experience, that is, almost everyone who works in data.

I created a repo to deploy Airflow on AWS following software engineering best practices. You can go straight there if you don’t feel like reading this post. But I do describe some things you might find useful here.

Image for post
Image for post
Zaanse Schans, Zaandijk, Netherlands | Photo by Wim van ‘t Einde on Unsplash

Run a docker-compose command and voíla, you have Airflow running on your local environment, and you are ready to develop some DAGs. After some time you have your DAGs (and Airflow) prepared for deployment on a production environment. Then you start searching for instructions on how to deploy Airflow on AWS. Here’s what you’ll probably find:

  • No instructions on Airflow documentation.
  • Some posts, like this one, teach you how to deploy on AWS ECS. Quite an interesting approach. The problem is that the whole tutorial is based on creating resources by point-and-click on the AWS console. Trust me; you don’t want to go that route for deploying in production. Just imagine the nightmare for creating three different environments (dev, staging and production) and having to repeat the process three times. Now imagine updating the environments and keeping them in sync. Picture how you could easily spend a whole week fixing a bug caused by a resource that was deleted by mistake. …

Sometimes we want to generate fake events to test our pipelines and dashboards. Random events don’t do the job. That’s why I built this Python package.

You want to implement a new streaming pipeline on your workplace and need to show your managers a Proof Of Concept. This POC should allow you to demonstrate some of the functionalities, in this case, generate real-time metrics. However, there is a limitation: you can’t use production data on a POC. How do you solve that?

If your answer was to generate fake events, then you are right. It will probably be the best solution. …


André Sionek

A little bit of each: data engineer and scientist, entrepreneur, physicist, writer and designer.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store