Complete project structure for an end-to-end Machine Learning Pipeline project

The complete project structure followed in the industry which you should definitely use in your next machine learning project, to make it more readable and inclined with industry standards.

Krishan Walia
ILLUMINATION
6 min readDec 31, 2023

--

Maintaining a good project structure is as important as it is to make a project. Gone are those days, when the creation of a project was the ultimate goal of an industry, as time passes by the focus of companies is shifting as well. Also, as tech industries are moving towards agile product development, it has become very important to write more maintainable and scalable code.

Image by Author

These big tech industries have come to a consensus for various terms that they aim to include in their codebase, and their consensus is backed by consistent hard hours put in by researchers, system managers, and product managers.

A codebase that involves efficient use of pep 8 (style guide), object-oriented programming concepts, and documentations, play a crucial role in making the codebase more maintainable and scalable. Along with these important concepts, a well-established project structure is often placed as an elephant in the room.

Project Structure is the main thing that one should definitely keep a check on, if someone is working on a bigger project then incorporating the well-defined industry-inclined project structure is really important. Apart from making their codebase well accepted by experts, it enables them to efficiently manage future updates on their projects.

In this article, you will learn about the project structure that one should follow while starting on a machine learning pipeline. This article explains the use of all the folders and files in a pipeline project.

Project Structure

This section is going to state the basic structure that should be followed in a machine learning pipeline project.

The structure should be,

Image by Author

The image above shows the basic structure that should be followed while creating a machine learning pipeline project. You will understand about each of the folders in the coming sections below.

📂artifacts

As the name of this folder suggests. It contains all the data files and the models, that will be referenced in the project. The idea behind this artifacts folder is to organize and store each and every element that has to be used in the project, such as the model, dataset, validation_status, model_metrics, etc as per the need of the project.

📁artifacts/data_ingestion

This is a folder, that houses all the datasets which have been ingested from the APIs or that have been scraped through the use of a web-scraper.

📘artifacts/data_transformation

This folder houses the data that has been transformed. This folder sometimes also contains the preprocessor in the binary format, which is used to transform the data.

🕶️artifacts/data_validation

This is a reference folder which contains the validation_status file. This file contains the information stating that if the data has been correctly ingested in the data_ingestion folder or not.

👟artifacts/model_trainer

Here, the model that has been trained over the dataset is saved in the form of .hdf5 or .pkl format. It helps in retrieving the model when predictions are to be made, instead of creating the model from scratch.

🧾artifacts/model_evaluation

This folder generally contains metrics.csv of the model, which states the accuracy and certain information about the model that has been trained.

🗒️config/config.yaml

This file contains configuration such as the path where different objects are to be saved, such as the data after ingesting, and transforming. This way the developers can skip hardcoding the paths in each file. This helps in accessible referencing the objects.

🧪research

This is generally the experiment folder. It contains all the Jupyter notebooks, which becomes handy while testing the code and experimenting with the implementation of the program.

🗃️src/projectName/components

This folder plays a crucial role and the files in it are generally those which are used in processing. The components for different stages in the pipeline are created such as the data_ingestion, data_transformation, model_trainer, etc. The logic of these stages is written in the files, which reside in this folder.

📦src/projectName/config

This folder contains the configuration.py file, which contains all the logic for accessing the content written in the config.yaml.
For different stages of the pipeline, different configuration fetching logic is defined in the umbrella class in the configuration.py file, which resides in this folder.

🚧src/projectName/constants

Generally, this folder contains the path of the files that are not going to change, such as params.yaml, config.yaml, etc.

📑src/projectName/entity

Different stages have different configurations in the config.yaml. There is a need to define different datatypes to access the different configs of the respective stages. All the defined datatypes are stated in the __init__.py file in this entity folder.

📖src/projectName/logging

Since, we are working on a pipeline project, which involves referring to different files and objects for different stages, it becomes next to impossible to determine if an error strikes in the codebase.
To avoid and to help us rectify and better understand the cause of the error, we generally make use of logging. And the logic for the log creation and saving resides in this folder.

➡️src/projectName/pipeline

This folder defines various files for various stages in the machine learning pipeline. The files in this folder call the respective components and the configurations present across the project for processing.

🪴src/projectName/utils

This is the most common folder that is used in almost all the code structures. It houses those files which in turn contain classes for the functions and methods that can be used at various stages in the project, such as load_object (to load the content of a file), save_object (to save the object at a certain location in the project), etc.

📲app.py

app.py contains the APIs for calling different functions in the pipeline project. This acts as the backend to the webapp, if the project is to be deployed in the future.

🚨main.py

This file is used in controlling the various pipelines in the pipeline folder. It runs all the pipelines and coordinates their output among each other.

⚓params.py

While working with deep learning or machine learning projects, we come across various hyperparameters that can significantly impact the accuracy of our model. Those hyperparameters instead of being hardcoded, are mentioned in this file and are referenced at the time of the model call.

🚥requirements.txt

This text file contains all the packages that are necessary to run the project. In the pipeline project, this requirements.txt also creates the package for the project, which is used while deploying the model.

🚀setup.py

This file contains the information about the package of our model. It keeps all the information relating to the project, such as the version, GitHub repository, the author's contact details, etc.

📔template.py

This file is the first file that is generally created to create all the folders and files at their desired locations. This file has a standard code and logic which is used in all the projects, for project structuring.

🎯Conclusion

This article gives insights into the importance of a good project structure in a machine learning pipeline project. Various folders and files are also explained in this article, their use and their significance in the pipeline project is well stated in this article.

🙏Today’s Quote

He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.
By — Leonardo da Vinci

🤝Connect with the Author

Hey, join my newsletter for more such amazing tips and tricks about data science and software development in general.

THANK YOU!!

--

--

Krishan Walia
ILLUMINATION

ML Expert | Entrepreneur | Full Stack Developer | Curious | Quick Learner | Meditator | Nature Lover | Researcher | Working on a Patent