Scalable Project Structure for Machine Learning Projects with PyTorch and PyTorch Lightning

Lefteris Charteros
6 min readJun 14, 2023

--

Photo by Jubal Kenneth Bernal on Unsplash

When embarking on a machine learning project, especially one that entails multiple components, it’s easy to get caught up in the excitement of model and architecture development and results prediction. However, one crucial aspect that often does not receive the attention it deserves is the structure of the project.

Just give me the code

If you’re someone who prefers to dive right into the code, I’ve got you covered! You can find the entire project template in the linked repository. Feel free to clone, fork, or download the repo and start your project. If you have any questions or if there’s something you’d like to discuss, just drop a comment — I’m always eager to help! Happy coding!

Why Project Structure Matters

A well-organized project structure is akin to a well-organized workspace. Just as a tidy desk can boost productivity and foster a better working environment, a clean and understandable project structure can significantly enhance the development process and the overall quality of the project.

Here’s why a good project structure is important:

  1. Enhanced Understandability: A project structured logically and intuitively is easier to understand, not only for others but also for your future self. Every directory and file in the structure has a clear purpose and contributes meaningfully to the overall project.
  2. Improved Maintainability: With a good project structure, modifying one part of the system has minimal impact on the others. This isolation reduces the risk of unintentional side effects, making the project easier to maintain and extend.
  3. Efficient Collaboration: In a team setting, a well-structured project is vital. It provides a clear roadmap for each team member, making it easier for them to understand the entirety of the project and contribute efficiently without the risk of interfering with each other’s work.
  4. Scalability: A well-planned structure can accommodate growth. As your project grows in complexity or size, a good structure will allow you to manage this growth effectively, ensuring that the increased complexity doesn’t turn into a liability.

Project Structure

Having discussed the importance of a good project structure, let’s now explore how to structure it. Keep in mind that although this structure is generally a good starting point for any machine learning project, it is best suited for PyTorch and PyTorch Lighting and follows patterns that help separate concerns and make the code more manageable using these libraries.

The structure looks as follows:

.
├── .data
│ ├── processed
│ │ ├── test.csv
│ │ └── train.csv
│ └── raw
│ └── data.csv
|
├── .experiments
│ └── model1
│ ├── version_0
| | └── ...
| └── ...
└── src
|
└── ml
├── data
│ ├── make_dataset.py
│ └── preprocessing.py
|
├── datasets
│ ├── dataset1
│ | ├── datamodule.py
│ | └── dataset.py
| └── ...
|
├── engines
│ └── system.py
|
├── models
│ ├── model1.py
│ └── model2.py
|
├── scripts
│ ├── predict.py
│ ├── test.py
│ └── train.py
|
└── utils
├── constants.py
└── helpers.py

Now, let’s walk through the directory structure:

.data: This directory is dedicated to storing all the data used in the project. It's further divided into two subdirectories:

  • raw: Contains the raw, untouched data as it was when collected or downloaded. Retaining the raw data separately is advantageous for those instances when reverting to the original data becomes necessary.
  • processed: Contains the data that has been processed and is ready to be used by the machine learning models. Usually, it contains the raw data split into train and test sets after some kind of preprocessing was applied. This can include cases where the data was cleaned, had features engineered, or was otherwise preprocessed.

.experiments: This directory stores the results of different model training runs. Each model may have several versions, each corresponding to a unique training run with potentially different hyperparameters or data. Although PyTorch Lighting names this directory .lightining_logs by default, it is preferred to rename it to make it easier to understand for anyone looking into the project for the first time.

src: This is the root directory for all the source code related to the project. Note that it also contains a subdirectory ml to further separate machine learning related code. This is especially beneficial when the machine learning module needs to integrate with another module (for instance, a backend), as it maintains a clean separation between different project modules, allowing team members to work on different parts simultaneously without interference.

ml/data: This directory holds scripts that handle data processing. It may include files like make_dataset.py (a script to download, filter, preprocess, and partition the raw data into training and test splits), and preprocessing.py (a script containing functions for data cleaning and preparation for modeling).

ml/datasets: This directory contains scripts that define how to load and handle the data used by the models. It might also contain subdirectories such as dataset1 for additional separation in case of multiple datasets. Each directory should contain at least two files:

  • dataset.py: Contains PyTorch's Dataset which allows for efficient and flexible data loading.
  • datamodule.py: ContainsPyTorch Lighting’s LightningDataModule which organizes the data loading and preparation steps and offers a clear and standardized interface for the data used in PyTorch Lightning systems.

ml/engines: This directory contains everything related to model training, validation, and testing and could also contain files related to these processes such as optimizers and schedules. For instance, system.py should include a LightningModule that defines the training, validation, and testing steps.

ml/models: This directory contains the scripts that define the different architectures of the models used in the project.

ml/scripts: This directory contains scripts for running different parts of the project, like train.py for training a model, test.py for testing, and predict.py for using a trained model to make predictions. These scripts typically include PyTorch Lightning’s Trainer class, which takes care of merging the LightningModule with the mLightningDataModule, running specified callbacks such as EarlyStopping and Checkpointing, and generally automating the entire machine learning process.

ml/utils: This directory contains helper functions, constants, and anything else that is used throughout the project. The general rule of thumb is that if an element is used across the project and doesn’t fit into the above directories, it should be placed in this directory.

This project structure offers several benefits. First of all, the clear separation of different project components into distinct directories facilitates scalability. As the project expands, additional datasets, models, and experiments can be seamlessly incorporated without affecting the existing structure. Moreover, this structure provides a clear semantic separation between various machine learning concepts (models, datasets, engine, etc.), creating a decoupled system that simplifies maintenance and updates. Lastly, the isolation of each model training run within the experiments directory streamlines the tracking of different model versions, thereby enhancing the efficiency of model management in production environments.

Conclusion

A well-structured project setup is not just a good-to-have but a crucial aspect of any machine learning project. It not only streamlines workflows but also makes it easier for others to understand your work, promoting collaboration and efficiency. While the structure outlined above is particularly suited for projects using PyTorch and PyTorch Lightning, it offers a solid foundation that can be adapted to fit any project’s needs. It provides a clear separation of concerns and promotes a modular way for project development.

Remember, a clean and well-organized project is easier to maintain, understand, and scale. Use this guide as your starting point, but keep in mind that there’s no one-size-fits-all solution. Don’t be afraid to tweak it to meet the unique needs of your project.

You can find this project template on my Github at this repository. If you have any questions or suggestions, I’d love to hear them! Please don’t hesitate to leave a comment— it’s always great to have a chat and discuss these topics further.

--

--