In the previous tutorial, we didn’t delve into the concept of Executors in Airflow. In Airflow, Executors define a mechanism by which task instances get run.

Repo link: https://github.com/mrafayaleem/etl-series

When you spun up an Airflow cluster using helm in the previous tutorial, you might have noticed a scheduler and a worker in your docker ps output. These two components are related to how a CeleryExecutor works in Airflow. We are not going to talk in detail about this type of Executor in this blog. For more details, see this link.

The other type that I would want to touch upon…


In this post, I will cover steps to setup production-like Airflow scheduler, worker and a webserver on a local Kubernetes cluster. Later on, I will use the same K8s cluster to schedule ETL tasks using Airflow.

Quickstart a Kubernetes single-node cluster

If you are on Mac, it’s just a matter of a ticking a checkbox and restarting the docker app as follows. This will enable a single-node K8s cluster on your local.

Once you have K8s running, install the K8s dashboard by running the following:

# Install dashboard
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0/aio/deploy/recommended.yaml
# Find list of secrets
kubectl get secrets
# Copy the token value…

This series is meant to cover a broad range of topics that involve setting up a production grade ETL pipeline. I have broken it down into chapters and more of them will be added as we move along this series.

I am a Mac user so everything that is written is in context of software and tools installed on OS X. You should be able to find and install equivalent versions of those for your own setup.

Feel free to leave any feedback on my Twitter handle or through my website.

Chapter 1 - Orchestration basics: Setting up Airflow in a local Kubernetes cluster using helm

Chapter 2 - Introduction to KubernetesExecutor and KubernetesPodOperator


In this post, we will aim at having a deeper understanding of Random Forest by taking OHLC (open-high-low-close) data for Tesla stock for 5 year period. We will keep unrelated discussion at a minimum and try to focus on everything that is related to our understanding of Random Forest for this post. It would be helpful if you are familiar with concepts such as mean squared error and R-Squared statistics but it’s not necessary to understand this post.

We will experiment with few features in our dataset to predict the adjusted closing price of Tesla stock for a particular day…


Imagine you have to rewrite an existing web service to move to a new payment gateway (PSP or payment service provider) due to various business use cases. Your first thought might be to replace old with the new one in its entirety and roll that out. That is a naive approach especially when you are working with payment gateways which have their own SLAs, agreements with acquiring banks, risk and fraud detection softwares, etc. which makes this process riskier in terms of conversions, revenue, customer retention and eventually business. …


Imagine you have to rewrite an existing web service to move to a new payment gateway (PSP or payment service provider) due to various business use cases. Your first thought might be to replace old with the new one in its entirety and roll that out. That is a naive approach especially when you are working with payment gateways which have their own SLAs, agreements with acquiring banks, risk and fraud detection softwares, etc. which makes this process riskier in terms of conversions, revenue, customer retention and eventually business. …


At dubizzle OLX, we recently reached a roadblock on a project where we had to change schema on around 130 tables in our production database with master/slave replication. The simplest possible solution that could have been was to recreate same tables with new schemas, copy the data and update our application to point to the new ones. This couldn’t have been made possible without a significant downtime for the whole website which wasn’t acceptable. So, we leveraged Percona Toolkit for MySQL to do the job and we would like to share our learnings in this post.

What is Percona Toolkit for MySQL?

Percona Toolkit is a…


Imagine you have to rewrite an existing web service to move to a new payment gateway (PSP or payment service provider) due to various business use cases. Your first thought might be to replace old with the new one in its entirety and roll that out. That is a naive approach especially when you are working with payment gateways which have their own SLAs, agreements with acquiring banks, risk and fraud detection softwares, etc. which makes this process riskier in terms of conversions, revenue, customer retention and eventually business. …


31 Mar 2016

v1.0

Note: This post is open to suggestions that can help achieve fairer results with these benchmarks. It is a versioned post and I would be incrementing it if anything related to these benchmarks changes. Changes can be tracked here as well.

In my previous post, I wrote about how we can interface Jython with Kafka 0.8.x and use Java consumer clients directly with Python code. As a followup to that, I got curious about what would be the performance difference between Java, Jython and Python clients. …


19 Mar 2016

Note: A complementing repository for this post can be found at: https://github.com/mrafayaleem/kafka-jython

With the release of Kafka 0.9.0, the consumer API was redesigned to remove the dependency between consumer and Zookeeper. Prior to the 0.9.0 release, Kafka consumer was dependent on Zookeeper for storing its offsets and the complex rebalancing logic was built right into the “high level” consumer. This was causing a couple of issues which are discussed here. Due to the issues involving complex rebalance algorithm within the consumer, some Kafka clients had no support for “coordinated consumption”. Essentially, it means that without “coordinated consumption”…

Rafay Aleem

Data Engineer at PointClickCare. Based in Toronto. Music aficionado who likes playing guitar and is an Eric Clapton fan.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store