Navigating the Minefield: Strategies for using DataFrames with Irreplaceable Missing Values

Abhishek Mahajan
8 min readJan 13, 2024

--

What do you do when you want to run KNN algorithm on a data frame with missing values ? That is the problem I try to tackle in this article. Now can’t I simply impute the missing values or drop the rows/columns that contain enough missing values? I will answer why I could not do it in this scenario. Follow along!

I encountered this problem on a dataset I was recently preprocessing for inputting to a KNN algorithm. The dataset had loads of NaN values but they were not concentrated in a bunch of columns, instead were distributed over all columns. I know this is a bit confusing to understand (I cannot use the snapshot of actual data due to it not being data open to all public) but let me generate a sample dataframe “df” representing that data:

dataframe named as “df”

The Problem

If you look at the data closely, you will find a NaN representing a missing value in every column, but also not concentrated all in a few columns such that if you try to simply drop rows with NaN values using the dropna() function, it will result in this empty dataframe that can obviously not be used:

Now, the next thought that would come into everyone’s mind that why could not impute the missing values simply. Well, in the original dataframe I worked with contained a lot of categorical columns with missing values. Now when I one hot encode the categorical values, I had to put NaN for all encoded columns if the value in the original encoded column was NaN and i cant impute the value either since the columns were contextual columns like “Age range” which also correlated with other values of that row. I know It is a little difficult to visualize without real data. Basically I could not simply impute NaN values since it would lead to data integrity problems.

Now , why exactly did this actually become a huge problem? Because the sklearn’s implementation of KNN algorithm actually uses L2 Euclidean distance to compute distances between different datapoints and guess what? That implementation of the algorithm is not compatible to use with dataframes with Nan or missing values. If you try to use sklearn’s KNN implementation on a dataset with missing data, you will be greeted with the following beautiful error message:

What shall immortus98 (aka me) do in this case is what i wondered for a while before reaching the conclusion that I need to custom implement KNN algorithm myself with a distance metric similar but not completely similar to the L2 Euclidean distance metric. Particularly I needed to group similar row-points together(we cant just ignore NaN values since they contain context in out case) , give importance to the non-NaN similarities in the rows, deprioritizing NaN but still keeping it in context. Sounds complex right? Let me break it down into a set of simple instructions to test the metric, according to which we could design the metric needed(a bottom up approach!)

The Criteria for the new distance calculation metric

The new distance metric needs to satisfy the following criteria:

  1. if two samples are identical, the distance between them should be zero.
  2. as the extent of difference between two samples increases, the distance should increase.
  3. as the extent of similarity between two samples increases, the distance should decrease.
  4. if in a pair of samples one or both have a NaN value for a given feature, the similarity or difference of this feature is unknown. Your distance metric should compute a smaller distance for a pair of samples with many similarities (even if there is some small difference) than for a pair of samples with mostly unknown similarity.

Still confused? Let us see some examples using 2 numpy arrays a and b where we compare a(1-D array) with all rows of b:

1.First criteria: if two samples are identical, the distance between them should be zero. So the distance between the below two numpy arrays must be zero:

a = np.array([[0, 1, 0,      1, 0, 0.3]] )  # A0 - test sample
b = np.array([[0, 1, 0, 1, 0, 0.3]] ) # B0 - same as A0, should have 0 distance

2.Second criteria: as the extent of difference between two samples increases, the distance should increase.

These should have increasing distance:

a = np.array([[0, 1, 0,      1, 0, 0.3]] )  # A0 - test sample
b = np.array([[0, 1, 0, 1, 0, 0.3], # B0 - same as A0, should have 0 distance
[0, 1, 0, 1, 0, 0.5], # B1 - has one small difference, should have larger distance than B0
[0, 1, 0, 1, 0, 1 ], # B2 - has more difference, should have larger distance than B1
[0, 0, 0, 1, 0, 0 ], # B3 - has even more difference
[1, 0, 1, 0, 1, 0 ]]) # B4 - has the most difference

These should have decreasing distance:

a = np.array([[0, 1, 0, 1, 0, 1]])            # A0 - test sample
b = np.array([[1, 0, 1, 0, 1, 0], # B0 - completely different, should have large distance
[1, 0, 1, 0, 1, np.nan], # B1 - less difference than B0, should have less distance
[1, 0, 1, 0, np.nan, np.nan]]) # B2 - even less difference than B1, should have less distance

3.Third criteria: as the extent of similarity between two samples increases, the distance should decrease.

These should have increasing distance:

a = np.array([[0, 1, 0, 1, 0, 0.3]] )  # A0 - test sample
b = np.array([[0, 1, 0, 1, 0, 0.3], # B0 - same as A0, should have 0 distance
[0, 1, 0, 1, 0, np.nan], # B1 - has less similarity than B0, should have larger distance
[0, 1, 0, 1, np.nan, np.nan], # B2 - has even less similarity, should have larger distance
[0, np.nan, np.nan, np.nan, np.nan, np.nan]]) # B3 - has least similarity, should have larger distance

4.Fourth criteria: if in a pair of samples one or both have a NaN value for a given feature, the similarity or difference of this feature is unknown. Your distance metric should compute a smaller distance for a pair of samples with many similarities (even if there is some small difference) than for a pair of samples with mostly unknown similarity.

These should have increasing distance:

a = np.array([[0, np.nan, 0, 1, np.nan, 0.3]] )  # A0 - test sample
b = np.array([[0, np.nan, 0, 1, 0, 0.5], # B0 - three similar features, one small difference
[0, np.nan, np.nan, np.nan, np.nan, np.nan]]) # B1 - much less similarity than B0, should have larger distance

The Solution

Now using all these criterias, I was able to create a solution for the distance/similarity metric as follows:

def custom_distance(a, b):
# getting the difference between all elements of the arrays at the same index
dif = np.abs(np.subtract(a, b))
# getting a similarity mask of the dif array
# by putting 1 where the difference is 0
# denoting fully similar items at the index
sim_mask = np.where(dif==0, 1, 0)
# getting a mask of the dif array by putting 1 where the the valeu is NaN
# denoting NaN value at the index of either one of or both arrays
nan_mask = np.where(np.isnan(dif), 1, 0)
# getting a dissimarity mask of the dif array
# by putting 1 where the difference is NOT 0
# denoting dissimilar items(of any degree of dissimilarity) at the index
dif_mask = np.where(np.logical_and(dif>0, dif<0), 1, 0)
# getting number of similar items, dissimilar items and NaN comparisons
n_sim = np.sum(sim_mask,axis=1)
n_nan = np.sum(nan_mask,axis=1)
n_dif = np.sum(dif_mask,axis=1)

# handling 0 division error by making 0 values as 1
# thus making 1 as the minimum value of all
# wont change the result direction much since we are taking fractions
n_sim[n_sim ==0] = 1
n_dif[n_dif ==0] = 1
n_nan[n_nan ==0] = 1
# getting L2 distance by ignoring the NaN values first
euc = np.nansum(dif,axis=1)
# getting the actual distance using a combination of
# L2 distance, fraction of dissimilar to similar items and
# a fraction of NaN values to non NaN values
dist = euc + 0.25* (n_dif/n_sim) + 0.75* (n_nan/(n_sim+n_dif))
# custom adjusting the distance to 0 if ALL values of two arrays are similar
# because it may give non 0 value due to the fractions used in the distance
dist = np.where(n_sim==b.shape[1],0,dist)

# return the custom distances between the two arrays
return dist

Explaining the algorithm:

  1. taking input of 2 arrays, take the absolute values of the element by element difference between two arrays and store it in an array.
  2. Generating 3 masks namely similarity mask, dissimilarity mask and nan mask for the array generated from the previous step.
  3. Computing the number of similar items, dissimilar items and NaN comparisons from the mask arrays generated from step 2.
  4. Handle 0 values for avoiding 0 division error as shown in code since we take fractions of these values in the upcoming steps
  5. Computing L2 distance by ignoring the NaN values first(using nansum function)
  6. Computing the custom distance using a combination of L2 distance computed above, a fraction of dissimilar to similar items, and a fraction of NaN values to non NaN values.
  7. Adjusting the distance to 0 if ALL values of two arrays are similar because our method would return non 0 distance due to the fractions used in the distance which is undesirable in the case of a fully similar matrix(we need to return 0 for these).
  8. Return the distances.

I checked running the algorithm against all the criteria, and it returned a desirable result. Now we can simply write a basic KNN algorithm like:

def knn_predict(train_data, test_instance, k):
distances = np.zeros(train_data.shape[0])

# Calculate distances between test_instance and each row in train_data
for i, train_row in enumerate(train_data):
distances[i] = custom_distance(test_instance, train_row)

# Get indices of k-nearest neighbors
indices = np.argsort(distances)[:k]

# Get labels of k-nearest neighbors
neighbor_labels = train_labels[indices]

# Return the most common label among the neighbors
return np.bincount(neighbor_labels).argmax()

NOTE: this is not the KNN code i used since the problem i was working on required a more complicated modified KNN algorithm which is out of the scope of this article. I may cover that in a future article in more detail.

Voila! Now you know 1 more way to handle irreplaceable missing values in a dataframe. While this is definitely not something that will be needed in every use case but you will once in a while come along where this, or a version of this method may be useful to use or at-least consider using it.

I’d love to hear your thoughts, feedbacks or any criticisms in the comments, so please leave a comment below for anything!

--

--

Abhishek Mahajan

A developer, part time educator, part time football enthusiast and a full time procrastinator.