Simple Python Environments For Data Science 🐍

Python environments have long been one of the most confusing and annoying parts of the python work flow for data scientists. The lack of a good explanation for environments was recently brought to my attention by this tweet:

While I’m no Jake VanderPlas, I am someone who has always struggled with python environments and definitely see the need for a clear and concise explanation of environments. Here you will find a simple guide explaining what virtual environments are and how to get them to work.

Goals

  1. Explain wtf a virtual environment is ❓
  2. Explain how to create and use virtual environments 😆
  3. Get a jupyter notebook up and running inside a virtual environment 😎

WTF is a virtual environment

Python virtual environments allow you to wade through the shitshow that is installing python packages. There are two very common python package managers that most people in data science use pip and conda.

pip: a python tool for installing packages from Python Package Index (PyPI)

conda: package manager for the anaconda python distribution which has quickly become defacto for data science due to its user-friendliness (disclaimer: I use anaconda as my python distribution)

It can be simple to install packages with these tools (sometimes) but we have all gotten some awful and confusing pip error. There is also the issue when packages depend on different versions of the same package, or packages don’t even work with your version of python your are using.

This is where virtual environments come in.

Virtual environments allow you to create different versions of python with packages specific to each project that you are working on. When you create a new virtual environment you are specifying the versions of python and a versions of the packages you need in order to prevent those awful import errors we all hate.

Creating and Using Virtual Environments

So now that we know about virtual environments and feel they are marginally important lets look at two tools to create and manage virtual environments.

Anaconda

Anaconda has a very simple structure for creating environments.

$ conda create --name <environment_name> python=<version_of_python>

This command will create a new environment located inside of the anaconda directory ~/anaconda/envs/[environment name]/ . When creating a anaconda environment you can include package names as arguments:

$ conda create --name test_env --python=3 numpy pandas scikit-learn

This will create a new environment called test_env the packages numpy, pandas, and sklearn installed already. To use the shiny new environment that you just created you simply run:

mac/linux: $ source activate [environment name]

windows: $ activate [environment name]

When you are in the conda environment you can conda install any required packages or if those packages aren’t available through conda channels you can also pip install packages like you would normally.

To view the installed packages in your current environment you can run conda list which will print out the packages.

How do I know what environment I’m in

To view all of your environments run command conda info -e which will return a list of your environments and place a star next to the current environment. When you activate your virtual environment it will change your prompt so that the name of your virtualenv is at the beginning of your prompt.

Here I am activating my conda environment `dl` and running conda info

Pipenv

If you have animosity towards authoritarian open sourced, or just want to work with straight PyPI packages there is a new option in pure python. Pipenv is the new cool kid on the block for managing python virtual environments courtesy of Kenneth Reitz. Unlike Anaconda which is geared toward scientific computing, Pipenv is built with python development in mind, specifically networking. This means that it doesn’t come with any built in features but works incredibly well and is fairly simple to use plus it has a really pretty cli. Pipenv is built to be a replacement for pip and virtualenv to create a simpler work flow for creating environments in python.

Creating an environment is very simple:

$ pipenv install [package names]

Thats it! This creates a virtual environment and installs the packages that were specified. A few things to note here, unlike creating an environment in anaconda, there is no name to be set. The environment will just take the name of the directory that it is created in.

To activate the environment you will simply run pipenv shell inside the project directory and viola you are in your shiny new Pipenv. To install new packages into your environment simply run pipenv install [package name] while the project is active.

Pipenv will store package info in a file called a Pipfile that will look something like this:

$ pipenv install numpy pandas scikit-learn
$ cat pipfile
[[source]]
url = "https://pypi.python.org/simple"
verify_ssl = true
name = "pypi"

[packages]
numpy = "*"
pandas = "*"
scikit-learn = "*"
[dev-packages]

You can also view package information by running pip list from inside an active environment.

Installing and Running Jupyter

Now that we have a few options to create and manage virtual environments we will demonstrate creating and using the environments to get a jupyter notebook installed and ready for data science action.

Jupyter with Anaconda

By running three commands we can get a jupyter notebook running in an environment that is ready for action.

# create the environment
$ conda create -n jupyter_env python=3 jupyter
# activate the environment (mac/linux version)
$ source activate jupyter_env
# launch a notebook server in our env
$ jupyter notebook

That is all! Super simple, super concise and it ‘just works’ which will save your time for all the super fun data munging you need to do before you can run models or make pretty data pictures. Now any time you want to work in a jupyter notebook you simply activate the environment and launch it in your project of choice.

Jupyter with Pipenv

Similar to anaconda, we can create a Pipenv fairly simply.

# create a project directory
$ mkdir jupyter_project
# change into the project directory
$ cd jupyter_project
# create your pipenv
$ pipenv install jupyter
# activate the environment
$ pipenv shell
# launch a notebook server in our env
$ jupyter notebook

Just two more steps but thats all we need to get up and running with pipenv. The nice thing about pipenv is that we can install it using pip.

Note - The biggest drawback of Pipenv is that the cli doesn’t allow global environment access which means you cant just run $ pipenv shell from anywhere, it will only work from inside the project folder. However, all the environments are located in a directory ~/.local/share/virtualenvs/ so to activate them from other directories all you need to do is run:

# list the virtual environments
$ ls ~/.local/share/virtualenvs/
# activate the environment from afar
$ source ~/.local/share/virtualenvs/[environment_name]/bin/activate

Future posts 🔮

Now that we’ve covered the basics of setting up simple environments for data science, there are some other important topics to cover. One of them is sharing your environments with others so that they can reproduce your code and interact with your research. Another important tool to cover is the actual jupyter notebook itself. There are many underutilized features of Jupyter notebooks that make machine learning projects much simpler.