How to set up virtual environments for model development: a beginner-friendly guide

Anyigor Tobias
6 min readJul 21, 2023

--

Photo by Andrea De Santis on Unsplash

Motivation:

I struggled the first time I tried to deploy and interact with my model through Streamlit. I kept encountering errors such as “Module not found error” and “Inconsistent version warning.” The latter occurred because I attempted to unpickle a model with a different Scikit-Learn version from what the model was pickled with. As a data scientist simply trying to perform tasks similar to machine learning engineers, those bugs posed significant challenges for me. This article aims to assist you in avoiding such issues and other related challenges from day one.

Guidelines on how to set up environments for model development.

Here are the assumptions I made while writing this article:
1. I assume you already have knowledge about Anaconda, Jupyter Notebook, VS Code, and Python, and that you have successfully installed them on your machine.
2. I assume you are familiar with and can access and use the command-line interface (CLI). If you are not familiar with it, don’t worry; you will pick up a few lessons from this article to help you get started.

Step One: Create a folder for your project.

Creating a dedicated folder for your project is indeed crucial. Your project folder or directory should encompass all the relevant files, including datasets, pickled models, Jupyter Notebooks, and Python scripts for deployment. It is not advisable to have project-related files scattered across your system. While you can create such directories using your file manager or other methods, I recommend using the command-line interface (CLI).

Assuming you are using a Windows laptop with Anaconda installed, follow these steps to create a new folder named “medium_blog” on your desktop directory in OneDrive:

  1. Open the Anaconda prompt (shell) or any other terminal of your choice, such as the PowerShell CLI.
  2. To navigate to the desktop directory in OneDrive, use the “cd” command. Type the following and press Enter:

Replace <YourUsername> with your actual Windows username.

Now, use the “mkdir” command to create a new folder for your project. In this case, we’ll name the folder “medium_blog”. Type the following and press Enter.

By following these steps, you will have successfully created a new folder named “medium_blog” on your desktop directory in OneDrive. This folder is an ideal place to store all the files related to your data science project, ensuring a neat and organized workspace

Step Two: Create a virtual environment for your project.

There are two popular ways to create a project environment: you can make use of pipenv, a Python library for creating virtual environments, or conda.

I will attach a link to an article on how to create virtual environments with conda here

How to create a virtual environment using pipenv

  1. Navigate to your project directory through the CLI.

In the image below, I navigated to the directory I created earlier.
Note that your directory must not be created inside the desktop directory.

2. Install pipenv using pip

3. Activate the new environment with this command

pipenv shell

4. To specify the Python version you will like to work with, do the following:

pipenv --python x.x
#here the x.x represents the Python versions for example, python 3.7

How to install required packages or dependencies using pipenv

After creating your virtual environment, the next step is to install the required packages. For instance, you can install packages like pandas, numpy, Plotly, seaborn, and others. There are two approaches to installing these packages. The first approach is to install all the needed packages using one command, and the second approach is to install them one after another. Here’s how to go about it:.

pipenv install pandas numpy seaborn plotly

You can also choose to specify the exact version of the packages you wish to install.

pipenv install pandas=0.20.3

After the package installation, a file called “Pipfile” will be created in your project directory. To generate another file known as “Pipfile.lock,” simply enter the following command in your command-line interface (CLI):

pipenv lock

Running this command will create the “Pipfile.lock” file, which contains a snapshot of the exact versions of all the packages installed in your virtual environment. This file ensures that anyone else who works on your project using pipenv will get the same package versions, preventing any potential compatibility issues. Keeping the “Pipfile.lock” under version control (e.g., Git) is recommended to ensure consistency across different environments and collaborators

Following these steps will ensure that your project is reproducible.

How to create a Python kernel for your data science project.

We are still in the project directory in the CLI. Here is how you can create a kernel for your model development.

  1. Install the ipykernel package. This package allows you to create a kernel for your pipenv environment
pip install ipykernel

2. Create the Jupyter kernel using the code below

python -m ipykernel install --user --name=my-pipenv-kernel
#replace the my-pipenv-kernel with the name you prefer. For examples, "medium_codes"

3. Open your Jupyter Notebook.
For Anaconda users, launch Jupyter and navigate to the project directory. Then you will create a new Jupyter Notebook in the directory.
4. Click on the Kernel option at the top of the notebook

5. Click on the Change kernel option. This will display all available kernels. Select the newly created kernel.

Now you can start importing your packages into your notebook.
You can also install other packages directly from the notebook or through your CLI.

How to make use of your kernel during deployment.

When you have completed the model training process, the next step is to pickle your model. However, the focus of this article is not to provide a detailed guide on how to pickle the model, as I assume you are already familiar with this process.

Moving forward, let’s switch to using Visual Studio Code (VS Code) for the model deployment. Personally, I find VS Code to be a preferred option for this purpose due to its capabilities and ease of use.

  1. Open VS Code and navigate to your project folder
  2. Try to open your Jupyter Notebook.
  3. Click on Select Kernel option and choose the kernel associated with your virtual environment for this project.

This is necessary if you want to make certain changes and, most likely, pickle a new model. The kernel will make available the same Python interpreter used during model training.

4. Create Python script(s) for model deployment.

This could be just a script or scripts. You can choose to make use of Streamlit or Gradio or create a web app from scratch using HTML and CSS.
It all depends on your projects. I assumed that you are just a data scientist trying to do a little machine-learning work, so I recommend Streamlit. Simple and easy.

5. Go to your CLI and navigate to your project folder.

6. Activate your virtual environment

pipenv shell

7. Run your scripts from the shell.

The advantages

It is strongly advised to carry out data science projects in a virtual environment. It allows you to avoid problems during model deployment, and moreover, when you upload your project to GitHub (git and GitHub are all about version control), whoever you want to share it with can easily clone your repository ( similar to downloading all the contents you uploaded) and run your projects with exactly the same versions of packages you built it with. This is what it means to build a reproducible project.

Coming soon

How to create your virtual environment using conda

Version control for data scientists: a beginner-friendly guide

--

--

Anyigor Tobias

Machine Learning Engineer | AI advocate| Technical Writer