How to conduct an ML experiment

7 min readOct 8, 2023

If you ever started a personal or professional ML experiment project, and you are trying to find a primer or template to kickstart.

You will notice there isn’t one. There is no hard and fast rule for this. So, here is a compilation of Do’s and Dont’s that I’ve encountered in my ML to experimentation journey so far.

Business and evaluation

Understanding the Business aspect of the project

As ML practitioners, we do have the habit of delving directly into the experimentation rather than asking the necessary questions.

What are the business requirements
what marks the experiment to be successful
Existing solution if exists and what are the shortcomings of it
think from the End-user perspective who will be using, be it a trained professional or used by an internal team member, or the general public
latency and throughput requirements when deployed

Define final evaluation metrics and stick to it

Before kicking off, you want to establish the right metrics to judge your experiments. You need to see how these metrics tie to business aspects directly and indirectly. These metrics should remain consistent across all experiments for the project. That is the only way you can derive parallel comparisons across experiments. If you end up changing metrics mid-way, generate metrics for old experiments as well.

Reproducibility

Reproducibility is a central goal when organizing an ML experimentation project. The objective is to ensure that experiments can be replicated in the future with the same configurations, yielding identical results. However, if, for instance, you forget to save the model weights, rerunning the experiment might not produce similar results due to reproducibility issues.

Containerize

Use docker containers to separate your modules and versions from your system. This is an important step in reproducibility. Start with bare pytorch or tensorflow docker image and build on it as you proceed with your experiments. At the end of your project, export the requirements.txt file for all your Python packages and also create a Dockerfile.

Seeding

When conducting an experiment, there is a lot of randomness involved, like Intiliazing of model weights, random train-test split, or random shuffling of train dataloaders when training. This randomization can affect your experiment’s reproducibility. Hence it would be best if you created global random seeds in your main training script for the modules that involve randomness(numpy, torch/tensorflow) and stored that seed value for that experiment. On another note, I suggest using random number generators rather than global seed.

Good practices with numpy random number generators

Unless you are working on a problem where you can afford a true Random Number Generator (RNG), which is basically never…

albertcthomas.github.io

Understanding the Importance of Data

EDA >> Modelling

As ML practitioners, we often prioritize model building over understanding the data itself. However, Exploratory Data Analysis (EDA) plays a pivotal role, and accounts for roughly half of the overall job. Overlooking certain details during EDA can significantly impact your project’s timeline, potentially setting you back by days or even weeks.

For instance, consider a scenario where you perform a randomized train and test split of your dataset without verifying the distribution of samples per class. Later on, you realize that some classes in your test dataset have no samples. This oversight becomes evident after conducting several experiments, as you observe unfavorable macro-level metrics (such as F1 score, precision, and recall) highlighted in red. To rectify this issue, you’re faced with the choice of either adjusting the split methodology or removing classes from the training set, necessitating the rerun of all previously conducted experiments.

Avoid Data Redundancy

When working with ML, you sometimes work with GBs or TBs of data. You may experiment with different data splits or different data subsets, this may lead to multiple copies of the same information stored in more than one place. Instead use references to the same data using formats like pandas dataframe, CSV, txt, or JSON.

Avoid Boilerplate code

To enhance code organization and manage changes across experiments more effectively, it’s beneficial to identify and segregate areas of code that are prone to frequent modifications, such as model architectures, experimentation configurations, dataset iterators, and test/train functions. These should be placed within the specific experiment’s folder. Conversely, recognize the modules that are less likely to change and maintain them in a common location. For instance, common modules like loss functions, utility functions, metric functions, and main.py/train.py can be stored centrally. This approach minimizes code differences between experiments and simplifies tracking.

Configurations

You may have heard this in other areas of Computer science as well but it dramatically applies to ML as well. Move all hard codings, every detail that might change in your experiment, goes in the config file. You can use any format for config like YAML, JSON, or py. It's easier to compare configs rather than scripts. Create a unique config file for each experiment. Config comparison is more intuitive to look at. Most ML experiment frameworks allow hyperparameters or config tracking for each experiment. Making it easier to search and compare across experiments.

Loading Functions from configs directly

This has always been the heart of my experimentation. Basically, this piece of code directly loads functions from the config file. It also allows static and dynamic parameters. The advantage of this is that 10 lines of code can be reduced to 1 line removing a lot of repetitive and boilerplate code.

Gist for function loading from config

Automation is the way to go!

As we proceed with experimentation, we don’t have a specific timeline in mind, the number of experiments can grow substantially, making it hard to perform manually. Instead, create helper scripts to automate menial tasks. Some examples can be:

Pushing model artifacts after every experiment to s3
Generate metrics reports or visualizations after each experiment

Recognize all these small details at the very beginning so you get to refine these tasks as you go.

Make git your best friend

As you progress you will change existing scripts, adding/removing parameters from classes/functions, trust me this always happens. Make a habit of git (add+commit+push) after every experiment, this will track every tiny detail and also make it easier to revert back if necessary.

Be Backward Compatible

As our experiments increase, we can find bugs in older experiments or the addition/removal of parameters in functions and classes leading to brittle code. Backward compatibility is important for reproducibility. After a bug fix, re-run and generate metrics for the old experiments that are affected by the change. Maybe the old experiments now perform well after the fix.

Tracking Experiments

As we go deeper in experimentation at a low level, we start missing patterns at a higher level. Having a framework or Sheet for tracking multiple experiments’ results allows us to see what we are missing. There are a lot of third-party frameworks, that allow us to track and visualize at a high level like W&B, Aimstack, Neptune, and many more. Some of them are self-hosted whereas others, such as W&B host on their own servers, may pose a risk to confidential and medical data. If you still can’t use any of these frameworks, you can use good old Sheets which provides essential features with collaboration.

Dry Run

Enhance the versatility of your primary experiment script by incorporating a ‘dry run’ feature. In essence, a dry run serves as a preliminary check to ensure that your script operates without errors and performs as expected prior to actual execution. During a dry run, time-consuming operations are minimized, and the focus is primarily on functional checks.

For instance, you can streamline the dry run by reducing the number of training epochs to just one, validating that the training loop functions correctly, instead of executing the full range of epochs, which may extend to hundreds or thousands. This strategic approach accelerates the process and allows for the early detection of any potential issues that may arise in later stages of the experiment

Some More Tips!

Be iterative in terms of experimentation, make minor adjustments rather than drastic shifts. This will allow you to see what’s beneficial and what’s detrimental to your project.
Avoid using jupyter notebooks as it is difficult to see code difference across commits.
Create multiple evaluation metrics or visualizations during training at each epoch for more in-depth insights. For example. generating visualizations while training your GAN to see your model's improvement or deterioration
If you are training for longer periods on remote instances, rely on techniques that won’t stop training when the connection breaks like running docker in the background, using the byobu terminal, or using screen functionality in the terminator.
Maintain folder structure

Conclusions

This article is born out of shortcomings and improvements over my ML career. I continuously iterate and improve my day-to-day processes to further streamline and employ best practices. There is no best way to carry ML experimentation, everyone does it in their own way. This list will be ever-growing and will be polished over time.

If you found it helpful, leave a clap 👏🏻. Feel free to share your feedback and thoughts in the comments. You can also reach out to me via email.