Common Python Mistakes While Preparing Machine Learning Models

Ravish Kumar
EnjoyAlgorithms
Published in
9 min readFeb 16, 2024

In almost every Machine Learning and Deep Learning code, we notice a similar structure:

  • The code starts with importing necessary libraries.
  • Then, we read the data files.
  • Analyzing and Preprocessing the read data.
  • Splitting the processed data into train val and test sets.
  • Create the model with optimizers and loss functions.
  • Finally, evaluate the model based on various metrics.

While implementing this pipeline in Python, we encounter various kinds of errors. This article will show common mistakes in each stage and learn their resolutions.

Common Mistakes at the stage of importing libraries

This is the first step in almost all ML projects, where we first import required libraries to use their functions inside our programs. For example, in the code below, we import libraries like Numpy for mathematical calculations, Pandas to read the file, and Matplotlib & Seaborn to plot the data features.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

1. ModuleNotFoundError

When we try to import a library that is not installed in the Python environment, it will produce a ModuleNotFoundError. For example, if Pandas is not already installed in a system, it will throw an error like this:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
No module named 'pandas'

Resolution of this error:

To solve the ModuleNotFoundError error, we need to install the corresponding library inside our Python environment. For the above example, we can install the Pandas library using pip like this: "pip3 install pandas". Sometimes, we must install some libraries from the source as pip installation does not work. In that case, we clone their git repository and then build the module by running the setup files. Once installed, the error will be resolved.

2. ImportError

This error appears when we try to import a function not present inside the mentioned library. For example, there is no function named "read_abc" inside the Pandas library, so it is throwing an import error.

from pandas import read_abc

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'read_abc' from 'pandas' (/Users/ravishkumar/EA_venv/lib/python3.9/site-packages/pandas/__init__.py)

Resolution of this error:

To solve the ImportError, we need to understand what supports or functions a library properly provides. For example, we have two libraries, Numpy, and Pandas, because their functional supports differ. We can not call a function defined in the Pandas library by importing the Numpy library inside the code. Also, in most cases, developers make some spelling mistakes, which is why this error occurs. So, we need to ensure that the function is present inside that library and that the function is called with proper spelling.

Sometimes, the installation goes differently than expected, and the library gets installed partially. In that case, we need first to uninstall the library using pip, e.g., pip3 uninstall pandas. Later, re-install it fresh like this pip3 install pandas.

Common Mistakes at the stage of reading data files

When we import the required libraries and functions, we must use them to progress with ML pipeline development. The first requirement is the data, which is maintained structured or unstructured. The most common way to keep the structured dataset is in the CSV files. These CSV files are then read inside Python programs using the Pandas library like this:

import pandas as pd

df = pd.read_csv('iris.csv')

1. FileNotFoundError:

This error appears when we try to read a data file that is not present at the mentioned location. For example, if we directly mention read_csv('iris.csv'), our programs expect that the "iris.csv" is present at the same location where the Python code file is.

FileNotFoundError: [Errno 2] No such file or directory: 'iris.csv'

Resolution of this error:

To resolve FileNotFoundError, we need to ensure two things:

  • Is that file present in our system?
  • If that file is present, did we mention the correct path location of that file inside our function? For example, if the file is at the Desktop location, then mention it like this: pd.read_csv('home/ravish/Desktop/iris.csv')

Ensuring these points will resolve the error.

2. EmptyDataError:

The characteristic of a CSV file is that the data samples should be separated with commas ",". If we provide the filename with nothing with a comma "," separation, it will produce EmptyDataError.

EmptyDataError: No columns to parse from file

Resolution of this error:

To resolve the EmptydataError, first, we can ensure comma-separated data inside a file by viewing it and opening it through any CSV viewers. If opening the entire data is impossible, we can re-fetch the files and check whether they contain the required data.

Common Mistakes at the stage of Analyzing and preprocessing data

Once the data is read, we need to process it to feed it to Machine Learning models for finding patterns. The data we read using the Pandas library gives a DataFrame object. To optimize the numerical computations on this, we change the DataFrame to numpy arrays like this:

df = pd.read_csv(
'https://raw.github.com/pandas-dev/'
'pandas/main/pandas/tests/io/data/csv/iris.csv'
)

arr = np.array(df)
print(arr.shape)

## (150, 5)

The most common errors at this stage will occur when handling large datasets in dataframes or numpy arrays.

1. ValueError because of NaN or Missing Values.

Suppose we want to convert the data types of a Python array, which contains NaN (Not a Number) values. This will throw an error.

Traceback (most recent call last)

ValueError: cannot convert float NaN to integer

Resolution of this error:

To solve this error, you need to first replace the NaN values in the data with your desired data type and then perform the desired operation.

np.nan_to_num(arr) ## This will replace all Nan values with zero

2. TypeError:

This type of error is prevalent when playing with data. We might perform arithmetic operations on two variables having different data types. For example,

X = 10
Y = 'a'
C = X + Y

#TypeError: unsupported operand type(s) for +: 'int' and 'str'

Resolution of this error:

Please ensure that the data types of the two variables are similar and then only perform mathematical operations using them.

3. IndexError

This can happen when we try to access an index from a list or numpy array when it does not exist. For example, if a numpy array has 10 values and we want to extract the value present at the 11th index, it will throw an error.

# IndexError: index 200 is out of bounds for axis 0 with size 150

Resolution of this error:

Please ensure that the index we are extracting from the array mentioned in our program is always less than the length of that array. Why is it less than the length of the array and not equal to the length of the array? Because the index starts from 0 in Python programming.

3. AttributeError

We know that everything in Python is an Object, and every object has some data type. Every type of object has some predefined attributes, but when we try to use an attribute that is not valid for a certain type of object, AttributeError occurs.


string = "Hello, world!"
string.reverse()

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [59], line 2
1 string = "Hello, world!"
----> 2 string.reverse()

AttributeError: 'str' object has no attribute 'reverse'

Resolution of this error:

To resolve this error, we must first ensure that a particular object contains that attribute.

Common Mistakes at the stage of splitting the dataset into Train, Validation, and Test sets.

After data processing, we split the dataset into

  • Train set: Used to train the Machine Learning model.
  • Validation set (Val set): Used to fine-tune the hyperparameters by checking the performance of the trained model and tweaking the values of the hyperparameters.
  • Test set: Used to evaluate the performance of the final trained model after tuning the hyperparameters.

Developers mostly use the Scikit-learn library for this job, so let's see errors related to that and mistakes while writing this splitting from scratch.

Splitting using Scikit-learn:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33,
random_state=42)

1. ValueError:

The splitting of the dataset depends on its availability. However, the train data percentage should be higher and capture all types of patterns in the entire dataset. We need to mention the test_size as the input parameter to the function, which lies strictly inside (0,1) and splits the dataset based on that library. So if we place something else in test_size, it will throw ValueError.

X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=1.5,
random_state=42)

# ValueError: test_size=1.5 should be either
# positive and smaller than the number of samples 5 or a float
# in the (0, 1) range


X_train, X_test, y_train, y_test = train_test_split(X,
test_size=0.33,
random_state=42)

## ValueError: not enough values to unpack (expected 4, got 2)

## This is due to the reason that we are not giving both input and output
## to the train_test_split function.

Resolution of the ValueError:

Check all the input arguments of the function and the allowed ranges for these values. This will solve the ValueError.

Scratch Splitting:

X, y = np.arange(10).reshape((5, 2)), range(5)
tot_len_x = len(X)
test_percent = 0.33
train_percent = 1- test_percent

X_train, Y_train = X[:int((train_percent)*tot_len_x)], y[:int((train_percent)*tot_len_x)]
X_test, Y_test = X[int((train_percent)*tot_len_x):], y[int((train_percent)*tot_len_x):]

Error possibilities with this implementation are higher, so one needs to be cautious while implementing it.

2. TypeError

Please note that splitting the numpy arrays requires integer indices, and when we multiply (train_percent)*tot_len_x, it can result in non-integer values and will produce TypeError.

X_train, Y_train = X[:((train_percent)*tot_len_x)], y[:((train_percent)*tot_len_x)]

## TypeError: slice indices must be integers or None or have an __index__ method

Resolution of TypeError:

To solve the TypeError while splitting the dataset, you can make sure that the indices are integer by placing an extra "int" in front of the multiplication int((train_percent)*tot_len_x). This will ensure that the indices are integer and solve the type error. There is a possibility of logical errors like making train_percent > 1, So one must be cautious while handling numpy arrays in splitting sets.

Common Mistakes at the stage of Model creation, choosing optimization and loss functions

There are many possibilities for encountering different errors while building ML models, choosing suitable optimization algorithms, and defining proper cost functions. But here, we will discuss only a few of them that are very common.

Value Error:

This error occurs when the training data and labels have different shapes.

model.fit(classifier.fit(X_train, y_train)

# ValueError: Found input variables with inconsistent numbers of samples: [112, 111]

Resolution of this error:

When we try to train a supervised learning model, we must ensure that the lengths of the labels and the length of the input data should be the same. It justifies that every input sample has a corresponding output.

The chances of making Python mistakes are very low in this stage if we use frameworks and libraries for different algorithms. If we take care of the functional arguments of an algorithm defined inside a library, the model will get trained. But, the chances of performing the logical errors at this point are high.

For example, if binary cross entropy is used as a loss function for solving multi-class classification problems or regression problems. This is a fundamental logical flow and can be resolved only with a better understanding and working of Loss and Cost functions and proper optimization algorithms while building ML models.

Common Mistakes at the Stage of Model Evaluation

After training the Machine Learning model, we evaluate its performance on the test data, and based on the performance, we send the model for deployment. But during evaluation, we can encounter the following common errors:

1. ValueError

This is one of the most common errors when we have Y_actual and Y_predicted arrays with different lengths. This error must have been propagated from the splitting dataset part.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

## ValueError: Found input variables with inconsistent numbers of samples: [38, 0]

Resolution of this error:

One needs to ensure that the shapes of the two variables (y_test, y_pred) are the same. If they have different lengths and shapes, we will always encounter errors.

2. Empty Image

plt.plot(accuracy_score(y_test, y_pred))
plt.show()

Please note that the image is empty, and it's a common mistake when we plot our ML model's accuracy on the test data.

Resolution of this error:

Please note that it's a logical but common error. Accuracy is just a number that does not vary for the entire dataset. It's the percentage of correct predictions out of all predictions.

Other possible logical errors include evaluating Precision when recall is required, and vice versa, or other theoretical errors. However, discussing them is out of the scope of this blog.

Enjoy Learning!

16 Week Live Project-Based ML Course: Admissions Open

--

--

Ravish Kumar
EnjoyAlgorithms

Deep Learning Engineer@Deeplite || Curriculum Leader@ enjoyalgorithms.com || IIT Kanpur || Entrepreneur || Super 30