Imputing Missing Well Log Data Values with Simple Statistics and KNN Imputer

Nahari Rasif
5 min readApr 2, 2022

--

Real-world datasets are frequently incomplete with certain values missing, as well as log data resulting from tool error or borehole conditions and other problems. This phenomenon makes a machine learning approach to log data required to perform certain steps before creating a model because machine learning cannot handle incomplete datasets directly. Missing value imputation is the major solution to this kind of problem. It can handle missing data even in complex big data series like well log data. In today’s article, we are going to discuss data imputation technique and how to use in the case of well log data.

Well log display that shows missing values in certain depth. Image by author.

What is data imputation?

Missing value imputation is the process of estimating discrete or continuous values to replace the missing one using a statistical or machine learning technique. The sample size remains the same, but the variance of the dataset is reduced. This can be done using simple statistics like mean, median, and mode or machine learning algorithm like K-Nearest Neighbors (KNN).

Why data imputation?

Without first addressing the issue of incomplete datasets, many data mining and machine learning approaches cannot be successfully used to develop models for forecasting trends and answering problems. The easiest approach to the incomplete dataset problem is listwise deletion, which involves removing data with one or more missing values. This method can only be used when the dataset has a minimal quantity of missing data, such as a missing rate of less than 10% or 15% for the entire dataset. However, due to their importance and rarity, many datasets may have higher missing rates and/or data with missing values cannot be erased. Thus, missing value imputation is the right solution and also applies to well log data.

Then, how can we apply imputation to our well log data?

In this article, we are going to use simple statistics (mean, median, and modus) and machine learning algorithm (KNN) to impute missing values in our well log data. but before doing that, we first import a library that can be useful for imputing process and well log data that has missing values in it.

Once we have entered important libraries and well log data, we now identify which parameters that have missing values in it, we can do that using pandas function.

well1.isna().sum()

in my data case, the full log parameter is just the SP log while the rest have missing values in it. We also get information about how much data is missing in a given parameter. Now, we can proceed to the imputation technique.

1. Simple Statistics (Mean, Median, Mode)

In order to impute missing values in log data using simple statistics, we first target our parameter that we want to impute. For example, I want to do impute processing on the GR log since it has the most missing values among other parameters. We can simply use pandas function.

Mean

Median

Mode

The mode method has a slight difference from the mean and median method. First, we need to know the highest frequency value in the log GR manually by using pandas function.

well1['GR'].value_counts()

As shown, the highest frequency value in the GR log is 76.8067 which has 23 points with the same value. We can use this value to impute missing data in the GR log.

well1['GR'].fillna(76.8067, inplace=True)

After we have imputed the missing data values by using either mean, median or mode methods, we can check the completeness of our data again by using the pandas function.

2. KNN (K-Nearest Neighbors) Imputer

KNN Imputer is a multivariate data (involving more than three or more variables) imputation tool that uses the K-Nearest Neighbours (KNN) approach to fill in missing values. The mean value, either weighted or unweighted, fills in each missing value from the n nearest neighbors discovered in the training set.

If a sample is missing more than one feature, the sample’s neighbor may be different. There is no defined distance in the training set if the number of neighbors is smaller than n_neighbour. During imputation, the average of that training set is used. By default, it uses a Euclidean distance metric to impute the missing values.

In order to use this KNN imputer, we need to split well log data into training and test dataset.

In my well log data case, the SP log becomes y_train and y_test because SP log is the only data that is complete among others. Thus, SP log is no longer in X_train and X_test datasets. We will use only X_train and X_test for this impute technique. The test size is 0.2 and random state 42 because the training data is made 80% and the randomized numbers will always be the same when run repeatedly.

After we split the dataset, we can now move on to the imputation of KNN in training and test dataset. In this example, we set the parameter as default with n_neighbors as 5 so that the missing values will be replaced by the mean value of 5 nearest neighbors measured by Euclidean distance.

Well, if we look at the results, we get two dataframes which consist of 80% of the entire dataset (training set) and the rest, which is 20%, is the test set.

So, how to combine these two dataframes into one? We can use some pandas functions to make it into a unified whole log.

Finally, we can check the well log data whether the data still has empty values or not. We can use one-line code that previously used in simple statistic imputation.

well1.isnull().sum()

As we saw above, all parameters in well 1 have no empty log values. This means that the imputation using the KNN technique has been carried out and imputed as the means of K-Nearest Neighbors.

Conclusion

There are various ways to deal with missing data. Deleting all observations if they have missing values can waste valuable data or reduce the variability of your dataset, especially for well log data that has a high level of data complexity. Therefore, imputation can be a great solution for dealing with missing values, either using simple statistics or machine learning algorithms like KNN.

References

Lin, Wei-Chao. et al. (2022) ‘Deep learning for missing value imputation of continuous data and the effect of data discretization’, Knowledge-Based Systems, 239. doi: https://doi.org/10.1016/j.knosys.2021.108079

--

--

Nahari Rasif

A geophysical engineering guy with high curiosity in data