Creating a machine learning operations folder for pipeline development and CI/CD

4 min readJun 14, 2024

There is a high overlap of job responsibilities between data engineer, data scientist, and machine learning engineer. The data engineer, as the name implies ensures that the dataset for training of ML models is available in a well-prescribed format. The data scientist on the other hand ensures this dataset is used for training a machine learning model architecture successfully and the machine learning engineer ensures that the values generated by the data engineer and data scientist are translated into the world of system engineering and software products. There are many components involved in productionalizing machine learning models which include, data ingestion, data validation, data preparation, model training, model validation, model development, and model maintenance and monitoring.

These are all components needed to enable one to develop and train machine learning models for production. Imagine a situation where the model created needs to be used in the real world for business. With this full-fledged automation is required which will aid reproducibility, model versioning, model registering, model development, and model monitoring and maintenance. However, a couple of tools are required to integrate all of this into pipelines for easy automation of the machine learning model.

Pipelines

Pipelining the development of machine learning components is essential for producing the machine learning model. Thus, in this case, we have the training pipeline and evaluation pipelines.

There are other files necessary for turning our machine learning into software products, these include, tox.ini, pyproject.toml, setup.py, setup.cfg, requirements.txt, requirements_dev.txt, unit.py, integration.py, .gitignore, .github/workflow, and ci.yaml. All of the above-mentioned files need to be created in our workspace together with our pipelines and machine learning components. The image below shows the structure of the folder in the workspace.

The below code snippet will generate the folder structure seen above.

import os

from pathlib import Path

list_of_files = [

“.github/workflows/.gitkeep”,

“src/components/__init__.py”,

“src/components/data_ingestion.py”,

“src/components/data_transformtion.py”,

“src/components/model_trainer.py”,

“src/components/model_evaluation.py”,

“src/pipeline/__init__.py”,

“src/pipeline/training_pipeline.py”,

“src/pipeline/prediction_pipeline.py”,

“src/utils/__init__.py”,

“src/utils/utils.py”,

“src/logger/logging.py”,

“src/exception/exception.py”,

“tests/unit/__init__.py”,

“tests/unit/unit.py”,

“tests/integration/integration.py”,

“pyproject.toml”,

“setup.py”,

“setup.cfg”,

“init_setup.sh”,

“requirements.txt”,

“requirements_dev.txt”,

“experiment/experiment.ipynb”,

“tox.ini”]

for filepath in list_of_files:

filepath = Path(filepath)

file_dir, filename = os.path.split(filepath)

if file_dir != “”:

os.makedirs(file_dir, exist_ok = True)

if (not os.path.exists(filepath)) or (os.path.getsize(filepath)==0):

with open(filepath, ‘w’) as f:

pass

Moving forward!!!

So, grab your seat and popcorn it’s about to get lit up.

template.py file was used to generate the folders and files present in the workspace as seen in the code above.

setup.cfg file is an extension of the setup.py file

tox.ini file is used for creating a local virtual environment, installing necessary libraries, and running programs. In this case, it was used to run the unit testing and integration testing.

pyproject.toml file is also an extension of the setup.py it can be used in place of the setup.cfg

.gitignore is used to isolate files that are not intended to be pushed to Github. This is done by including the file/document name in the .gitignore file.

requirements.txt file is used to save the library to be used in the machine learning operation project.

requirements_dev.txt file is used to save the library to be used in the development environment in the machine learning operation project.

unit.py file contains the testing snippet for each unit of the project

integration.py file contains the testing snippet for the integration of all the units of the project.

utils.py file contains the helper functions of the machine learning projects.

logging.py this file contains the snippet for logging errors and code executions in the project.

exception.py file contains all the exception snippets in the project.

init_setup.sh This is a bash file for unifying all the instructions needed to set up the environment rather than typing the instructions one by one. Remember we are dealing with automation, hence, this comes handy.

Now we will move to the project pipeline which involves the development of a machine learning model. We have two pipelines as stated earlier, the training and evaluation pipeline. The training pipeline contains three components namely; the data ingestion, data transformation (imputer, data balance, encoding), and model training while the evaluation pipeline contains the model evaluation component. Let me explain this file one after the other.

Data_ingestion.py file helps to load the dataset in the correct format e.g. dataframe file format.

data_transformation.py file helps to transform data, be it data balancing, dataset encoding, or filling empty cells with value.

model_trainer.py file entails all the code for model importation for conventional machine learning models or model architecture for deep neural networks.

However, this is not all because there is a lot to uncover when it comes to productionalizing machine learning models. See you in the next blog post….

Au revoir!

Creating a machine learning operations folder for pipeline development and CI/CD

Written by Adebayo Abdulganiyu Keji