TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Member-only story

Data Engineering — Basics of Apache Airflow — Build Your First Pipeline

9 min readJun 20, 2019

--

If you’re in tech, chances are you have data to manage.

Data grows fast, gets more complex and harder to manage as your company scales. The management wants to extract insights from the data they have, but they do not have the technical skills to do that. They then hire you, the friendly neighbourhood data scientist/engineer/analyst or whatever title you want to call yourself nowadays ( does not matter you’ll be doing all the work anyways ) to do exactly just that.

You soon realise that in order for you to provide insights, you need to have some kind of visualisation to explain your findings and monitor them over time. For these data to be up to date, you need to extract, transform, load them into your preferred database from multiple data sources in a fixed time interval ( hourly , daily, weekly, monthly).

Companies have workflow management systems for tasks like these. They enable us to create and schedule jobs internally. Everyone has their own preference, but I would say many of the old WMS tend to be inefficient, hard to scale, lack of a good UI, no strategy for retry, not to mention terrible logs that makes troubleshooting / debugging a nightmare. It causes unnecessary…

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Nicholas Leong
Nicholas Leong

Written by Nicholas Leong

Data Engineer — Crunching data and writing about it so you don’t get headaches. 1M+ reads on Medium. https://www.linkedin.com/in/nickefy/