Linear Data Imputation

Matias Morant

Published in

Knowledge Graph Digest

7 min readApr 12, 2021

We present a method for filling-in data with missing values (data imputation).

This method is a generalization of linear regression.

Try it now

You are just a pip install away:

pip install linear-imputation

Or check the source and tutorial here
Or try it now on KgBase

Comparison to other methods

1- Listwise deletion: ‘if some data is missing, throw it all away’. (why work if we can be lazy ?)

2- Hot-deck: ‘fill the missing values with a random value picked from the same column’ (Hey, maybe we are lucky)

3- Cold-deck: ‘fill the missing values with values of a similar point, from some other good, complete dataset you have’ (You think I have a lot things, right?)

4- Mean substitution: ‘complete the missing values with the mean of each column’ (because correlation isn’t a thing)

Our method uses all data present. It fills-in missing values considering both the mean and correlation of your data.

Comparison to Linear Regression

Data imputation is a more general problem than linear regression was designed to solve. Our method is a generalization of linear regression.

For linear regression you usually have complete training input and output, which you use to fit a model. Then you apply this model to some complete test input, to get an estimation:

If there were missing values in training or test data, you can’t apply linear regression directly, you first have to deal with the missing data somehow:

The method we present is a robust method which expects as input a single matrix with missing values:

This is a more general problem: the original linear regression problem can be represented as filling-in a matrix which has all its values missing on the bottom right:

Think of distributions

To make this subject more intuitive, you should think of estimating a distribution, instead of fitting a line.

See the data below: A and B are real data of countries, C is made up.

For A, a line is a good model; for B, less so; for C, not at all. If you run least-squares regression on C, you will still get some line; but which line you get will depend a lot on how you represent your data (for example, which property is the X and which is the Y axis)

Instead, it’s better to just think of your data as samples of an underlying distribution, and try to estimate this distribution (Lighter areas represent more likely values) :

Note that you can recover a line from the distribution, you take its main diagonal.

Estimating the distribution

Key insights:

we can easily estimate the mean μ and covariance σ² of the distribution, despite missing data
μ and σ² (likely) won’t have missing values
If σ² had missing values, it is reasonable to fill those values with 0
If μ had missing values, there are input columns which don’t have any data. You should discard those.

Example

Compute the mean μ and covariance σ² for this data:

Solution

The mean μ is

The covariance matrix σ² is

Note we are left with μ and σ² without missing values, despite missing input data.

Once we have μ, σ² we assume these to be the parameters of the distribution that generates our data. We assume the PDF of this distribution to be

Where x is one of our data points/rows and g is an arbitrary function (For example, you can pick g such that f is a Gaussian distribution. Which g you pick is irrelevant)

The final result

We might have a point/row x with some of its components missing. Let’s say it has m values missing and p values present.

We would like to fill these m missing values with the most likely values:

find the missing values that maximize f(x)

Solution

To maximize f(x) we need to minimize

We can reorder the components of x such that all missing values are grouped together, and this won’t change the value of f(x) (we have to reorder the components of μ and σ accordingly).

We reorder and split x, μ and σ this way:

where:

xₘ values are all missing and xₚ values are all present
μₘ is the mean corresponding to missing components and μₚ is the mean corresponding to present components
A has dimensions m×m, B has dimensions m×p and C has dimensions p×p

We want to maximize f(x) with respect to xₘ

We rewrite our objective function as:

Taking the derivative of the above expression with respect to xₘ and setting equal to 0 yields:

solving this equation yields the most likely missing values:

Example

Given this small dataset with missing y value for the 3rd row:

Why are we getting 0.53 as an estimation?

Seeing the first 2 rows, you would expect to get 30 as an estimation. Implicitly, you are doing this :

1- observe 1st row → (0, 0)

2- observe 2nd row → (1, 1)

3- build a model with observed rows → y = x

4- apply model to 3rd row → if x=30 then y=30

Note that in that process you are failing to include the 3rd row into the model. How do you know that y = x is still a good model after observing 3rd row ? What does 3rd row tell you about (your ignorance of) the relationship between x and y ?

Instead, you should do this:

1- observe 1st row → (0, 0)

2- observe 2nd row → (1, 1)

3- observe 3rd row → (30, Missing)

4- build a model with observed rows → some_model

5- apply model to 3rd row → some_estimation

The key is all rows influence your estimated distribution (in particular, this 3rd row, which produces the difference between what you were expecting and the actual estimation).

So why are you getting 0.5 as an estimation ? In several different wordings:

Explanation 1

The more data is missing, the greater your ignorance, the more conservative the estimation becomes (estimations will tend towards the mean).

In this example, the mean is:

x=(0+1+30)/3~10

y=(0+1)/2~0.5

Explanation 2

After seeing the first two rows, it seems like the values will be small single digit, highly correlated, your estimated distribution looks like this:

Using this previous distribution, you would estimate that if x=30 then y=30.

However, after seeing 3rd row, you realize that there’s much more variance in the distribution (x=30). And the y value of the 3rd row is the most important for estimating the correlation (the other 2 rows don’t matter so much because they are so close together), but it’s missing, therefore you are ignorant about the correlation. Your new estimated distribution becomes:

The 3rd row is a line because you only know the x coordinate. The most likely value along this line is y=0.5, the mean.

Explanation 3

This gif animation shows how the estimated distribution changes as you move the x value of 3rd row from 1 to 15. Note how you lose confidence on the correlation as the 3rd row becomes a greater outlier.