Ultimate Data Science WorkFlow -With Docker
Containerising Your Workflow
“ Simplicity is the ultimate sophistication “ — Leonardo Da Vinci
This article aims to compile all the nessecary data science technologies into a single stack and deploy them together in unison with the help of docker.
Table of Content :
- Stage 1 : Intro to Docker and Workflow
- Stage 2 : Technologies used
- Stage 3 : Docker Compose Code
What do I mean by ‘WorkFlow’?
A workflow refers to all the processes and sub-process that are executed while working on let’s say a big project. In Data Science this workflow refers to the combination of Data Cleaning, EDA, Model Development and Deployment.
Why Docker?
Simply put Docker is a containerisation software that helps, Package Software into Standardised Units for Development, Shipment and Deployment.
Developing nowadays requires so much more than writing code. Multiple languages, frameworks, architectures, and interfacing between them becomes complex very quickly. Docker simplifies this workflow, while giving developers the freedom to use their choice of tools, application stacks, and deployment environments for each project.
Technologies used in this Data Science Stack :
All of the technologies listed below have been selected because they are Open-Source and Free.
and some supporting tech also.
The What, Why and When.
Jupyter
What :
Jupyter provides developers with Notebooks as open-source web applications that allow to create and share documents that contain live code, equations,and narrative text.
Why :
The Jupyter Project is backed up by millions of developers and is the goto software for almost all data scientists and is the industry norm.
When :
All Data Science projects start by importing, examining and manipulating data. Jupyter helps in this by providing an interface for data cleaning, transformation and visualizations.
PostgreSQL
What :
PostgreSQL is a open source object-relational database system that uses the SQL language to store and scale the most complicated data workloads
Why :
PostgreSQL comes with many features that help developers build applications, and protect data integrity and build fault-tolerant environments. Also it is free and open source.
When :
After manipulating, cleaning and visualising the dataset all this work needs somewhere to be stored, that is where postgres comes in, with its fluent integration with Python and a web app interface the PGAdmin it removes all the complexities of storing and managing datsets.
Apache Airflow
What :
Airflow is a platform created to help developers schedule, monitor and manage workflows.
Why :
Airflow has a modular architecture its pipelines can be confugred as Python code and it is extensible i.e Allows the creation of custom operators and executors.
When :
Teams tend to work on sub-tasks individually which a causes rise in need for automation and monitoring, this is where Airflow shines by giving the management tools to keep there workflows running smoothly.
MinIO
What :
MinIO is high-performance, software-defined object storage suite for
machine learning and analytics. It brings focuses on the Web scaling model while keeping users needs in mind.
Why :
MinIO is open source. This means that MinIO’s customers are free from lock in, free to inspect, free to innovate, free to modify and free to redistribute. MinIO also has a fluent compatibility with AmazonS3 which makes it the best cloud-native solution.
When :
When building machine learning models it is nessacary to store, re-develop and improve the models time to time. This is where MinIO comes in it gives the developers a platform and minimalistic interface to save and experiment with their models.
Apache Superset
What :
Apache Superset is a modern, enterprise-ready business intelligence web application that gives teams the necessary tools to present their insights to customers.
Why :
Superset provides users with a rich variety of visualisation tools, dashboards and enterprise ready authentication. It is Open-source, Free and allows integration with SQL RDBMS with the help of SQLAlchemy.
When :
After the data has been cleaned and models have been developed, the insights need to be presented to the clients with the help of Superset’s visualisation and semantic layering users can breeze through these needs.
Docker to the rescue !
While all these tools are awesome on their own, we need a way to make them work together on a single platform.
This is where docker comes in by creating containers for each of these tools and making them communicate with each other seamlessly we can create our Ultimate Data Science Workflow.
Here is a glimpse of how all of this is made possible :
To get started with your own workflow container head over to my GitHub for detailed code and instructions.