Data Pipelines with python(draft)

Let’s build an end-to-end data pipelines with:

  • Python
  • AWS
  • Celery
  • Dask
  • Luigi and Airflow
  • Hadoop
  • Apache spark
  • Django
  • Property based hypothesis testing

I’ll also point to some useful resources like the 12 factor app, docker containers etc. This will be a relatively intermediate post. Beginner for data engineers. The github repo with all the code can be found here:

My first 5 minutes on a server gives a pretty good start on getting basics of security right when playing with a new server.


Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.

Parallel programming with python — o’reilly

Resources

Few supplemental links and useful resources: