HOW TO TEST ML CODE?

Published in

Marvelous MLOps

6 min readOct 21, 2023

Just like how we check if small parts of our code work (unit testing) or if different parts work well together (integration testing) in software applications, we can write similar tests for our machine learning code base. This helps us make sure our ML code is correct, trustworthy, and robust.

ML models are affected by lots of things, the data they use, how the data is prepared, and the algorithms they’re based on. When we test the code properly, we can find mistakes early on while we’re building, making sure everything works as expected and the code stays good as the project grows.

In this article, I’d like to give you some ideas on when and how to write tests for ML code.

Installing pytest

A popular testing framework in Python is pytest. You can install it using the following command:

pip install pytest

More resources on pytest are provided at the end of the article.

Repository Overview

The code snippets are extracted from the movie-recommender repository, where a basic recommender algorithm is implemented.


movie-recommender/
│
├── data/     #Contains the movie ratings dataset
│   └── ratings.csv
│
├── topn/
│   │
│   ├── preprocess/ #Handles data cleaning
│   │   └── proprecessor.py
│   │ 
│   ├── model/      #Recommender algorithm
│   │   └── recommender.py
│   │
│   ├── evaluation/ #Evalution functions
│   │   └── evaluator.py
│   │
│   └── utils/
│       └── helper.py
│
├── tests/   #Holds tests for preprocessor and recommender modules
│    │
│    ├── test_preprocessor.py
│    ├── test_preprocessor.py
│    └── test_evaluator.py
│  
├── setup.py
├── main.py  #Main execution file
├── .gitignore.py
│
└── README.md

For testing, it is best to adopt Object-Oriented Programming (OOP) best practices and design your codebase in a modular manner. This approach will simplify the process of writing both unit and integration tests.

Write unit tests for preprocessing functions

Most ML codebases typically include preprocessing functions to prepare data for training, often involving modules for data cleaning and transformation. You can write unit tests for these functions using mock data (created within pytest fixture) or sample data (loaded from data files) to ensure they operate as intended.

DataPreprocessor class:

class DataPreprocessor:
   def __init__(self, data):
       self.data = data


   def clean_data(self):
       # Example: Drop duplicates and handle missing values
       self.data.drop_duplicates(inplace=True)
       self.data.dropna(subset=['userId', 'movieId', 'rating'], inplace=True)


   def transform_data(self):
       # Example: Convert userId and movieId to categorical values
       self.data['userId'] = self.data['userId'].astype('category').cat.codes
       self.data['movieId'] = self.data['movieId'].astype('category').cat.codes


   def get_data(self):
       return self.data

Test code with creating mock data:

import pytest
from pandas import read_csv, DataFrame

from topn.preprocess.preprocessor import DataPreprocessor


# Unit tests for DataPreprocessor class

# Creating mock data for testing
@pytest.fixture
def mock_raw_data():
    data = DataFrame({
        'userId': [1, 1, 2, 2, 3, 4, 5],
        'movieId': [101, 102, 101, 103, 102, None, 104],
        'rating': [5, 4, 3, 2, 1, 3, None]
    })

    return data


def test_clean_data(mock_raw_data):
    preprocessor = DataPreprocessor(mock_raw_data)
    preprocessor.clean_data()
    output = preprocessor.get_data()

    expected = DataFrame({
            'userId': [1, 1, 2, 2, 3,],
            'movieId': [101.0, 102.0, 101.0, 103.0, 102.0],
            'rating': [5.0, 4.0, 3.0, 2.0, 1.0]
        })
    
    assert output.shape == (5, 3)
    assert output.equals(expected)

def test_transform_data(mock_raw_data):
    preprocessor = DataPreprocessor(mock_raw_data)
    preprocessor.transform_data()
    output = preprocessor.get_data()

    assert output['userId'].dtype == 'int8'
    assert output['movieId'].dtype == 'int8'

Test the same code with loading sample data:

import pytest
from pandas import read_csv, DataFrame

from topn.preprocess.preprocessor import DataPreprocessor


# Unit tests for DataPreprocessor class

# Loading mock data for testing
@pytest.fixture
def sample_raw_data():
    data = read_csv("tests/resources/ratings.csv")
    return data

def test_clean_data_2(sample_raw_data):
    preprocessor = DataPreprocessor(sample_raw_data)
    preprocessor.clean_data()
    assert sample_raw_data.shape == (20, 4)

def test_transform_data_2(sample_raw_data):
    preprocessor = DataPreprocessor(sample_raw_data)
    preprocessor.transform_data()
    assert sample_raw_data['userId'].dtype == 'int8'
    assert sample_raw_data['movieId'].dtype == 'int8'

Write tests for recommender functions

We can write tests for the RecommenderSystem class to confirm that it provides accurate recommendations. These tests are possible when it is known that, given a specific training dataset and input for prediction, the output consistently matches the expected result for the ML model.

RecommenderSystem class:

from pandas import Dataframe
from sklearn.metrics.pairwise import cosine_similarity



class RecommenderSystem:
   def __init__(self, data):
       self.data = data
       self.user_movie_matrix = self._create_user_movie_matrix()
       self.movie_similarity_matrix = self._calculate_movie_similarity()


   def _create_user_movie_matrix(self):
       user_movie_matrix = self.data.pivot_table(index='userId', columns='movieId', values='rating', fill_value=0)
       return user_movie_matrix


   def _calculate_movie_similarity(self):
       movie_similarity = cosine_similarity(self.user_movie_matrix.T)
       movie_similarity_matrix = DataFrame(movie_similarity, index=self.user_movie_matrix.columns, columns=self.user_movie_matrix.columns)
       return movie_similarity_matrix


   def get_top_n_recommendations(self, movie_id, n=5):
       similar_movies = self.movie_similarity_matrix[movie_id]
       recommended_movies = similar_movies.sort_values(ascending=False).head(n + 1).index.tolist()
       recommended_movies.remove(movie_id)
       return recommended_movies

Testing with mock data:

# Unit tests for preprocessing functions
import pytest
import pandas as pd

from topn.model.recommender import RecommenderSystem


@pytest.fixture
def mock_data():
    data = pd.DataFrame({
        'userId': [1, 1, 2, 2, 3],
        'movieId': [101, 102, 101, 103, 102],
        'rating': [5, 4, 3, 2, 1]
    })
    return data

# Integration test for RecommenderSystem with mock data
def test_recommender_system(mock_data):
    rec_sys = RecommenderSystem(mock_data)
    recommendations = rec_sys.get_top_n_recommendations(101, n=2)
    assert len(recommendations) == 2
    assert recommendations == [102, 103]

Testing with sample data:

# Unit tests for preprocessing functions
import pytest
import pandas as pd

from topn.model.recommender import RecommenderSystem


# Loading sample data for testing
@pytest.fixture
def sample_data():
    data = pd.read_csv("tests/resources/processed_ratings.csv")
    return data


# Integration test for RecommenderSystem with sample data
def test_recommender_system_2(sample_data):
    rec_sys = RecommenderSystem(sample_data)
    recommendations = rec_sys.get_top_n_recommendations(3, n=5)
    assert len(recommendations) == 5
    assert recommendations == [0, 11, 18, 17, 16]

The motivation behind writing tests for a model class is to catch when the model starts misbehaving. Of course, any changes in hyperparameters, sample data, or the algorithm logic will change the expected outputs in testing functions. Make sure to update your test functions when such changes happen.

Write integration tests for Evaluator class

Similar to the RecommenderSystem class, we can write integration tests for RecommenderEvaluator. This class includes functions to calculate some metrics. By providing specific training data and corresponding ground truth data points, we can know the metrics that will be computed and create test functions to validate their accuracy.

RecommenderEvaluator:

class RecommenderEvaluator:
   def __init__(self, recommender, gold_data):
       self.recommender = recommender
       self.gold_data = gold_data


   def evaluate(self):
       true_positives = 0
       false_positives = 0
       false_negatives = 0


       for movie_id, similar_movies in self.gold_data.items():
           recommended_movies = self.recommender.get_top_n_recommendations(movie_id, n=len(similar_movies))
           for movie in recommended_movies:
               if movie in similar_movies:
                   true_positives += 1
               else:
                   false_positives += 1
           false_negatives += len([movie for movie in similar_movies if movie not in recommended_movies])
      
       accuracy = true_positives / (true_positives + false_positives + false_negatives) if (true_positives + false_positives + false_negatives) > 0 else 0
       precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
       recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
       f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0


       evaluation_results = {
           "F1_score": round(f1_score, 2),
           "Accuracy": round(accuracy,2)
       }


       return evaluation_results

Testing with sample data:

import pytest
from pandas import read_csv

from topn.evaluation.evaluator import RecommenderEvaluator
from topn.model.recommender import RecommenderSystem

# Unit tests for Evaluator class

# Creating mock data for testing
@pytest.fixture
def mock_gold_data():
    gold_data = {
        3: [0, 11, 18, 17, 16],
        11: [7, 1, 18, 17, 16],
        16: [11, 1, 12, 17, 15],
        17: [0, 3, 18, 16, 15]
    }

    return gold_data

# Loading mock data for testing
@pytest.fixture
def sample_raw_data():
    data = read_csv("tests/resources/processed_ratings.csv")
    return data


def test_evaluator(sample_raw_data, mock_gold_data):
    rec_sys = RecommenderSystem(sample_raw_data)
    evaluator = RecommenderEvaluator(rec_sys, mock_gold_data)
    evaluation_results = evaluator.evaluate()

    assert evaluation_results['F1_score'] == 0.80
    assert evaluation_results['Accuracy'] == 0.67

Run tests

Once you are inside the project directory, you can run pytest and specify the directory where your tests are located. In this case, our tests are in the tests/ folder. Run the following command:

pytest tests/

You can also execute specific test files:

pytest tests/test_preprocessor.py

Conclusion

The examples here are just a start; in real projects, it’s likely to have more functions to test. When you test different parts of your code and how they work together, you catch mistakes early. This keeps your code quality high and helps build robust machine-learning applications. Keep testing!

Resources for pytest:

https://docs.pytest.org/en/6.2.x/contents.html

https://testdriven.io/blog/pytest-for-beginners/

https://www.tutorialspoint.com/pytest/pytest_fixtures.htm