We used to get files from the software that controls Gousto’s factory once a day via an SFTP server: several CSV files containing atomic data for each box that went through the production line on the previous day, such as the timestamps of when a box starts and exits the line. This data was used by Gousto's operations teams to measure our Supply Chain performance and detect issues on production lines.
We had an ingestion pipeline composed of a Lambda Function that moves files from the SFTP server to our Data Lake in S3 plus a job triggered by Airflow on EMR. The whole pipeline was ingesting CSVs, applying some simple transformations, saving table as Parquet and exposing data to users with Redshift Spectrum. …
I created a repo to deploy Airflow on AWS following software engineering best practices. You can go straight there if you don’t feel like reading this post. But I do describe some things you might find useful here.
Run a docker-compose command and voíla, you have Airflow running on your local environment, and you are ready to develop some DAGs. After some time you have your DAGs (and Airflow) prepared for deployment on a production environment. Then you start searching for instructions on how to deploy Airflow on AWS. Here’s what you’ll probably find:
You want to implement a new streaming pipeline on your workplace and need to show your managers a Proof Of Concept. This POC should allow you to demonstrate some of the functionalities, in this case, generate real-time metrics. However, there is a limitation: you can’t use production data on a POC. How do you solve that?
If your answer was to generate fake events, then you are right. It will probably be the best solution. …