Custom Data Science Workspace using Docker
As a beginner in Data Engineering, I wanted to experiment with Spark and Hadoop. Since both of them work without much hassle on Unix based OSes ( Linux, macOS ) it seemed like a good idea to install Ubuntu on my laptop so that I can dual boot between Linux and Windows.
But, it’s a lot of work. Restarting your laptop to do some programming and restarting again to do the rest of the things is actually not very efficient. ( I need Windows because I do some editing during my free time, and those programs only support Windows. )
I wanted something which can let me use all the good things of Linux while retaining the Flexibility of Windows for my other tasks.
Answer: Docker Containers
It seemed pretty neat to use a Docker container for running Hadoop and Spark since I read about HortonWorks Sandbox ( Now owned by Cloudera ) which is available as a Docker image, and lets you use almost all the big data tools out of the box with no additional setup.
The Catch: Size
Everything felt okay until I started to pull the Sandbox Docker image. I was thinking why it is taking so long to download and then I noticed that it is ~22 GBs in Size…!
22 GB. That’s really huge, especially for a guy who just wants to do some work on Spark and Hadoop. So I dropped the idea of using the HortonWorks Sandbox docker image.
Final Solution: DIY Sandbox
Since I am an Engineer and by nature, I like to put together things and make them work for a bigger purpose. I fired up Docker and pulled the latest Ubuntu image as the base for my CDW ( Custom DataScience Workspace )
Then I did some research about compatible versions of Hadoop, Spark, and JDK. I zeroed in on Hadoop 3.2.0 and Spark 3.0.0 with JDK 8. Which are theoretically compatible with each other, which means there will be very little chance of running into issues later.
Two hours later my Docker image with Spark and Hadoop was ready to use..!
I ensured that everything and every command works out of the box, and the end-user does not have to configure anything.
Instructions for using my Docker image:
Make sure you have at least 8 GB of RAM on your system and Docker Desktop is set to use 4 GB of RAM. My Sandbox is barely 2 GB in size, so don’t worry about storage and download issues ;-)
Launch Docker and run the following commands :
docker pull chandanshastri/cs_ds_v1.21
docker run — hostname csds — name cs_ds -it — entrypoint /bin/bash chandanshastri/cs_ds_v1.21:latest
This will create a container from my image, and that’s it. There are no more steps involved…!
To start working inside the container just run these commands :
docker start cs_ds
docker exec -it cs_ds /bin/bash
You will get the bash terminal from inside the container. Change your directory to /home. Do a ‘ls’ and you will see the Spark and Hadoop directories.
You can run sh csds.sh to start the Hadoop DFS and YARN. It will also fire up the Thrift Server on port 10000. You can access the Apache Hive using Beeline.
To use Spark, you can either use spark-shell ( Scala ) or PySpark, which I have configured to launch with Jupyter Notebook for ease of use and matplotlib visualizations. If you want to get serious, you can write programs for Spark and run them using spark-submit command too.
As a bonus, I have configured the Apache Airflow, which again, runs out of the box. MySQL, Postgres, Apache server, PHP and Tensorflow ( YES ) are also ready to use with no additional configuration.
If anything asks for a password, just hit ‘ 321 ’.
Everything just works.
And it tastes best with Visual Studio Code’s remote development tools.
Note: You can run my custom Hadoop and Spark distributions present inside /home natively on Windows. Just copy them from the container to your local path. But you need to set environment variables and edit the *-site.xml files to set the Hadoop name node and data node directories according to your system path pattern. I have created a custom cmd file in Spark’s bin directory for launching the thrift server on Windows. ( The default implementation was running into issues on Windows )
But you will miss out on many configurations features of the Hadoop ecosystem which is currently supported only on Linux.