A data exploration portable environment for the data curious

Gonçalo Valente
Marionete
Published in
5 min readMar 3, 2021

Inspired by the available data for COVID-19, I was inspired to create a quick and easy solution to crunch some numbers.

The story behind the developments presented here begins with my suspicion over statements regarding COVID-19 data. Data interpretation presented in the news and social media like schools don’t contribute to virus propagation really raised my eyebrow and left me wanting to quickly crunch some data and figure out if there was any truth to those kinds of statements. Because I am far from being a data scientist, I just wanted a quick and easy setup and improvised exactly that.

Additionally, when discussing this article with a colleague, I learned that something like what I developed and is presented below is in demand: apparently, I solved this guy’s dream without even knowing 😊.

Proposed solution

In a past project working as a DevOps engineer, I had to set up virtual machines (VMs) for data scientists to use for what is, for other people, dark magic. They were using Jupyter notebooks, which should not come as a surprise to anyone who has even the slightest background in these topics. I am now putting this set up to personal use with some modifications, leveraging two things that a DevOps engineer loves: containers and a bit of automation. It allows me to quickly spin up an environment that facilitates crunching some data just with a tiny bit of Python knowledge.

Additionally to DevOps work, I sometimes write some code — mostly for PoCs and that sort of stuff — and I consequently am a growing fan of containerised development environments, so the proposed solution leverages a Docker container running Jupyter Lab, which is

a web-based interactive development environment for Jupyter notebooks, code, and data.

There’s also some automation in place, using Shell scripts, to ease the burden of saving your work using version control standards. Despite being really simple commands and operations to someone used to working with git or version control systems, it can really make things easier for those who are not (in the references, you can find a getting started with git tutorial).

The setup that I will present has been developed in an Apple MacBook Pro running macOS Big Sur (Version 11.1), but should be compatible with any Unix-based operating system.

As for the tooling that you are required to install on your machine, this setup relies only on git and Docker.

Setup and Getting started

The first step is to clone the repository which contains the setup and work for your data exploration environment. In my case, I am cloning my data-curious GitHub repository (this tutorial is being presented as if I was using it, if you want to replicate this environment you will need to fork my repository or replicate its structure, also copying the automation scripts):

git clone https://github.com/gnvalente92/data-curious.git

That’s it, you have everything you need to start crunching some data!

Usage

The following bullet list, its code snippets and screenshots pretty much summarise the utilisation of this little data science environment.

  • To start the Jupyter Lab instance mentioned in the previous sections, it is only necessary that you run the following command (from the root of your repository):
./scripts/start.sh <[OPTIONAL]BRANCH_NAME>
  • The script will, depending on if a branch already exists in the remote repository with the selected name, either create it or check out it in your current working directory. The branch name is selected as follow:

If you don’t pass in the optional argument, the name will be in the form of feature/work-<CURRENT-DATE>;

If you pass the optional argument, the branch name will be what you provide, assuming it’s a valid branch name.

  • It will also spin up a docker container running Jupyter Lab and mount your repository’s work/ directory to the working directory of Jupyter, which means your notebooks will be shown in the navigation bar and will be saved to git when you finish your data exploration.
  • After running the script, this will be the console output. Just click the address in red to access the user interface (UI).
  • Using Jupyter notebooks is quite intuitive, I am not going to go into details because I am no expert and you can find everything in other articles and tutorials (see References section for the official documentation). Here’s a screenshot of a sample notebook I created, it imports, displays and plots a few lines of COVID data using pandas (data science Python library) datasets (this notebook can be found in the data-curious repository under work/covid19-pt.ipynb):
  • The .gitignore file in this repository will automatically ignore the internal files generated by the Jupyter notebook and only store your actual work.
  • When you are finished with the data exploration and have all your desired notebooks saved, just hit File/Shut Down in the notebook’s UI to shut it down.
  • The final step is to upload all your work to the git repository which, making use of the automated scripts, is just a matter of running the following command:
./scripts/save-progress.sh <[OPTIONAL]"COMMIT-MESSAGE">
  • Running this will store your work to your git repository either using a custom commit message, if you provide the optional parameter in the command above, or using Automated commit message .

Next steps

If you think there’s something fundamentally wrong here or if you have ideas on how to improve this little data exploration environment feel free to contact me, or adapt and raise a pull request (PR) in the data-curious repository, you'll have my full attention.

Additionally, you can just tweak the behaviour of the shell script on your own, but I’m keen on getting feedback, so get in touch!

References

--

--

Gonçalo Valente
Marionete

Aerospace engineer turns risk advisory consultant turns Big Data engineer turns DataOps engineer, confused? :) https://www.linkedin.com/in/goncalovalente