How to structure a machine learning project

Banjoko Judah

Published in

Analytics Vidhya

5 min readNov 30, 2021

How I structure mine

Introduction

Having a structured directory to put files into is very useful and can speed up your machine learning project. While there is no one way of structuring a machine learning project (in fact, any project), I would share how I usually organize mine.

Overview

You can check below for the complete structure, after which I would go through what each directory keeps.

├── data/
│
├── notebooks/
│
├── saves/
│   ├── data/
│   ├── models/
│   └── samples/
│
├── tests/
│
└── utils/

/data

Here, I usually store the raw, unprocessed, original copies of any datasets in my project. Whether I downloaded it from Kaggle or I built it myself using a web scraper or manually, whatever the case.

The files in this folder are usually not modified once I put them here. You can think of it as a data lake. This way, I always have the original copies of my data and can reproduce and redo as much as I like.

/notebooks

While notebooks are not ideal for production, they are great for running experiments, analyses, giving presentations, etc. It is nice to have them in a separate folder to make them easier to track, especially when trying several things together.

/saves

You are very likely to want to save some things in your machine learning project so that you do not always have to start from scratch every time. Processed data (in varying levels) and your models are top of this list. Usually, I create separate sub-folders (/data, /models) for storing them at the start of the project.

There are also times when I am working on a generative model, like a model that generates music. For this, I use a /samples sub-folder to save the music samples that my model generates.

/tests

Did I hear someone grumble?

I bet you did not see this coming but tests are great to have, even in machine learning. The truth is we do perform tests, but they are usually done manually, all over the place, and when we are not getting the desired results. You can make it better organized by having them together in one place and being intentional about it.

/utils

Now to the final directory. Here I keep the Python scripts that are important to my project. Some call it /scripts, some /src, I call it /utils, you can call it whatever you like.

Codes that end up here are usually at first scattered across multiple notebooks with countless repetitions. But as the project progresses, I would want to reduce these repetitions and manage my code better.

Eventually, I move them from my notebooks, turn them into functions, and group them in files in this folder according to what they do. The codes for processing data, building, training, and evaluating models, are commonly found here.

Notebook Specials

Before ending, I want to include these short specials on notebooks which you may find interesting.

Solving relative import error

With the current project structure, if you tried importing a module from the /utils directory into a notebook in the /notebooks directory, you would get an Import error. It is something about relative imports not being possible in a notebook.

# Generates an import error
from ..utils import models

The fix I usually use is to create a variable, BASE_DIR, that stores the project directory and append it to sys.path. What this does is that it makes Python look through my project directory for what I want to import each time I run an import statement. Then I can access any script in my project starting from the project directory.

import os
import sys# For one-level deep notebook
BASE_DIR = os.path.abspath("../")
if BASE_DIR not in sys.path:
    sys.path = [BASE_DIR] + sys.path# This works
from utils import models

You can choose to use environment variables instead of a Python variable if that makes more sense.

Dealing with moving notebooks

Another thing I always do is to write paths relative to BASE_DIR. So, each time I move any notebook around (or a duplicate which happens a lot!!!), all I need to change is the BASE_DIR variable, and everything works as expected.

torch.save(clean_data, BASE_DIR + "/saves/data/clean_data.pt")

Working with an updated script

Organizing your code into scripts (which is a great practice) adds some inconveniences. You want any update to your Python scripts to reflect instantly in your notebooks, but that does not happen. Calling import on a module that has already been imported also does nothing.

For this, the reload function from the importlib library comes in handy, and you do not have to restart your notebook each time your work is updated.

from importlib import reload
from utils import models# Reloads a module that has been imported before.
reload(models)

Working on Colab

Finally, if you are like me and love to have a copy of your project on Colab, then you might find this interesting. Mounting (connecting) your google drive to your notebooks is necessary to access files and use space on your Google drive. Usually, I put the code for doing this at the first code cell in my notebooks.

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

We also have to change our reference to BASE_DIR with the exact path from our Google drive as things are a little different on Colab file systems. Thanks to our previous works (making each path relative), we can work as we would locally.

# It may look like this
BASE_DIR = "/content/drive/MyDrive/MyCurrentProject/"

So, when working locally, I comment out the mounting part and change the BASE_DIR. Then on Colab, I uncomment it and use the appropriate BASE_DIR. Awesome!

Conclusion

That is it for this article. You might find some of the directory names weird, and of course, you can always change it to whatever makes sense to you. The main point is understanding what they do and using a name that says that.

If you found this article helpful, please remember to share, clap and follow me on medium.

Thanks for reading!

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com