Understanding Kalman Filters with Python

14 min readMay 4, 2018

Today, I finished a chapter from Udacity’s Artificial Intelligence for Robotics. One of the topics covered was the Kalman Filter, an algorithm used to produce estimates that tend to be more accurate than those based on a single measurement alone.

While Thrun’s course has been helpful, I found myself still unable to articulate how Kalman Filters work or why they are useful, so I decided to watch a series of lectures by Professor Biezen. I highly recommend going through them, as most of this writing is based on his lectures.

Why Kalman filters?

Imagine we are making a self-driving car and we are trying to localize its position in an environment. The sensors of the car can detect cars, pedestrians, and cyclists. Knowing the location of these objects can help the car make judgements, preventing collisions. But on top of knowing the location of the objects, the car needs to predict their future locations so that it can plan what to do ahead of time. For example, if it were to detect a child running towards the road, it should expect the child not to stop. The Kalman filter can help with this problem, as it is used to assist in tracking and estimation of the state of a system.

The car has sensors that determines the position of objects, as well as a model that predicts their future positions. In the real world, predictive models and sensors aren’t perfect. There is always some uncertainty. For example, the weather can affect the incoming sensory data, so the car can’t completely trust the information. With Kalman filters, we can mitigate the uncertainty by combining the information we do have with a distribution that we feel more confident in.

Examining Errors

Usage of the Kalman filter acknowledges that some error and noise is implicit within both the prediction and measurement of the variables we’re interested in tracking.

The filter does not assume all errors are Gaussian, but as cited from the Wikipedia description, the “filter yields the exact conditional probability estimate in the special case that all errors are Gaussian.” Recall, that the Gaussian is characterized by two parameters: the mean (mu) and the width the of Gaussian, namely the variance (sigma squared).

The constant that appears before the exponential.

The quadratic difference between query point x relative to mean mu.

Instead of representing the distribution as a histogram, the task in Kalman filters is to maintain a mu and sigma squared as the best estimate of the location of the object we’re trying to find. In the formula above, I summarized the normalizing constant as c because what’s most important is noticing the relationship between the mean and the variance.

Notice, if x equals the mean of a distribution, the difference between the mean and the input is 0. We get e raise to power of 0, equaling 1.

We prefer a narrow Gaussian as it would have less variance, indicating more confidence in the data.

Sample Gaussian distribution with a narrow variance

The Kalman filter represents all distributions by Gaussians and iterates over two different things: measurement updates and motion updates. Measurement updates involve updating a prior with a product of a certain belief, while motion updates involve performing a convolution.

Measure Update

Measurement updates use Bayes Rule. Imagine we’ve localized another vehicle, and have a prior distribution with a very high variance (large uncertainty). If we get another measurement that tells us something about that vehicle with a smaller variance. If we create a new gaussian by combining the information, the mean will be somewhere in between the two distributions, with a higher peak and narrower variance than the prior.

It might be surprising that the subsequent Gaussian is peakier than the component Gaussian, but it makes some intuitive sense: by combining both, we’ve gained more information than either Gaussian in isolation.

In the following equations, nu and r squared is respectively the mean and variance of new observed data. We can use it to update the prior mean and variance.

Update the prior mean with the weighted sum of the old means, where the weights are the variances of the other mean. The mean is normalized by the sum of the weighting factors.

Update of the variance uses the previous variances.

It’s interesting to note that if we had two Gaussian distributions with the same variance but different means, the resulting Gaussian distribution would be somewhere in the middle, but with a variance that’s half of the original variance. We can see this calculating the resulting variance, which would be sigma squared over 2.

Measurement Update step of mean and variance for a one dimension Kalman filter.

Prediction / Motion Update

For the prior, the car is believed to start in some position. But when it moves, it moves position. The motion itself has its own Gaussian distribution and uncertainty. We arrive at a prediction that adds the motion command to the mean, and has increased uncertainty over the initial uncertainty. We accumulate more uncertainty as we change position.

The mean is updated by taking the previous mean and adding the motion of the movement, indicated by variable u.

The variance is updated by taking the old sigma and adding onto it the variance of the motion Gaussian

Notice that the variance update will always result in an increased variance.

Prediction Update of a 1D Kalman Filter

Designing a Kalman Filter

Flowchart of a Kalman Filter Matrix process, inspired by Prof. Biezen’s flowchart

The Kalman filter is a uni-modal, recursive estimator. Only the estimated state from the previous time step and current measurement is required to make a prediction for the current state.

The flowchart above shows the recursive process. From the beginning, the filter keeps track of State variable that contains the current value of the measurements being tracked (e.g. position, velocity), while the Process variable contains the predictive error of those measurements.

Each time step, the state transition matrix moves the state and process matrix based on the current position and velocity, estimating a new position/velocity as well as new covariance.

From there, the Kalman Gain is calculated, along with the observed data. The update process involves using the Kalman in conjuction with the previous estimate and new observed data to update the state variable towards a belief that’s somewhere between the prediction and measurement. The process covariance is also updated based on the Kalman gain. These updates are then used for the next round of predictions.

Kalman Gain

The Kalman gain is used to determine how much of the new measurements to use to update the new estimate.

To calculate the gain, we need two things. One, we need the error in the estimate (or the original error). Everytime we calculate the error in the estimate, we use that information to update the Kalman gain. Two, we need error in the data/measurement, because as we continually get data inputs into the estimate we need to determine how that affects the gain.

What the gain does is put a relative importance between the error in the estimate vs the error in the measurement. If the error in the estimate is smaller, we put more emphasis on it. Vice versa for the error in data. What feeds the overall calculation depends on how much we can trust the prediction and data (which we base on the error).

Kalman Gain is calculated by comparing the error in the estimate relative to the error of the estimate and measurement combined

The number for the Kalman gain will be somewhere 0 and 1. It is then used to update the value of the current estimate. In the equation below, x represents the estimate, K is the Kalman gain, which is multiplied (acting like a weight) by the difference between the measurement (p) and the previous estimate.

The Kalman gain is used to adjust the update to the state.

If the Kalman Gain is close to 1, it means the measurements are accurate but the estimates are unstable. We can infer this by looking at the Kalman Gain ratio, where if the error of the measurement is small (and practically contributing nothing to E_est over E_st), then K will be equal to one. When the error of the measurement is small, future predictions will be strongly updated based on new input data.

However, if the Kalman Gain is small, then it the error in the estimate is large relative to the error in the estimate. Estimates are more stable and the measurements are inaccurate. As a result, any difference between new data and the prediction will have a smaller effect on the eventual update.

With each iteration, the model will make estimates closer to the true value resulting in a smaller Kalman Gain. New data could have a lot of uncertainty and we don’t want it to throw off the predictive model.

Updating estimates weighted by Kalman gain

To recalculate the error in the estimate, we simply need to multiply the error of the measurement with the error of the previous estimate, and divide it by the sum of the errors.

The equation is sometimes written as follows:

Updating estimation errors

Where E is the error of the estimate and the Kalman Gain is multiplied by the previous estimate. The one minus K factor being multiplied with the previous error is the inverse of the size of the Kalman Gain. If the Kalman Gain is large, the error in the measurement is small, so new data can quickly update the model to the true value, which will subsequently reduce the error in the estimate. However, if the Kalman Gain is very small, error in the measurement must be very large, then updates should be slowly.

The Kalman Gain ultimately drives the speed in which the estimated value is zeroed in on the true value.

State Matrix

The state matrix records the object being tracked. It could be another car on the road or a plane in the air.

State Matrix Prediction

The condition of the state is updated based upon the previous condition of the state plus the control variable matrix u, with w representing noise that may have accumulated through the process.

The state matrix can represent multi-dimensions multiple ways, but the ordering will affect how the A matrix is constructed.

The state matrix can contain information such as the position (x and y) and the velocity (x dot and y dot) of an object.

Updating the position involves determining the displacement of the object given the acceleration and velocity.

Matrix A times x represents the current state and velocity based on the next time step (delta t). A time step is taken, and the velocity is added onto the previous position to update the position of the object. The velocity remains the same. The velocity may have changed after the time step due to acceleration (control variable matrix). If there was acceleration, than this calculation isn’t complete since the acceleration would’ve affected the velocity. This update however is applied in Matrix B times u.

The B matrix mimics part of the kinematics equation where the velocity and acceleration are multiplied by time. When matrix B is multiplied by the control variable (in this case, acceleration) and added to AX, it results in a change to the position and velocity due to acceleration.

Observation Data

Part of the Kalman filter process is imparting observation data with the state matrix containing the most recent prediction.

Observation function.

The observation is equal to matrix C times the variables observed plus measurement noise. The shape and entries of matrix C is dependent on the number of variables we want to observe. If we want to examine all the variables in Y, then C would largely just be an identity matrix. If however some variables of Y aren’t being observed, C could be shaped in a way so as to discard some of the variables of Y.

For example, if the incoming data contains four entries (x, y, x velocity, y velocity) but we only want two (x and y), matrix C can be shaped to provide us only with the information that we want to retain (in this case, a 4 x 2 matrix with ones along the diagonal).

Z represents measurement noise.

State Covariance and Measurement Covariance Matrix

The state covariance matrix is an error of the estimate. The state matrices are estimations of the subsequent states. Since estimations have noise, errors, and uncertainties, Q, a process noise covariance matrix is added, which will adjust the state covariance matrix. It will be used to help the Kalman gain place emphasis on either the predicted value or the measured value.

If the variance is set to 0, it means we have a fairly large certainty in the corresponding measurement. So for example, if the state covariance of the x position is 0, it means high confidence in the prediction of the x position.

The measurement covariance matrix (R) is the error of the measurement. Whenever a measurement is taken for the object that is being tracked, it doesn’t mean that the measurement is exact, as there could be some error in the way the object is tracked.

Where P is the state covariance matrix, Q is the process noise covariance matrix

Recall from earlier, if the measurement errors are small relative to the prediction errors, we want to put more trust in them (hence, the Kalman Gain will be closer to 1). Otherwise, if the measurement errors are larger than the prediction errors, the Kalman gain will put less emphasis on the difference between the prediction and the measurement.

If P is 0, then measurement updates are ignored. That would imply the predictive values are quite accurate and ignore the observations. It is highly unlikely that the model would have such exactitude, so one of the benefits of Q is that prevents P from being zero.

The A and H matrices are largely present to help format the matrices.

Covariance, where x_bar is the mean of x

Covariance is a measurement of the joint variability of two random variables. If the variables tend to show similar behavior (e.g. the higher values of one variable correspond with the higher values of the other random variable), the covariance is positive. However, if the higher values of one variable correspond with the lesser values of the other variable, the covariance is negative. The sign is important because it indicates the tendency in the linear relationship between the variables.

The following is a simple 2x2 covariance matrix, where the variances of x and y are along the main diagonal, and the covariances occupy the upper and lower half of the diagonals. The variance of x could represent the variance in the velocity, while y could represent the variance in the position.

The variance, both plus and minus, should provide us the range of possibilities within the distribution. It also informs us how far from the mean we could anticipate a measurement, almost guaranteeing that no value will beyond the range of the mean minus/plus the variance.

To practice computing a covariance matrix, I produced the following code which would solve the sample problem as provided in this lecture where Professor Michel van Biezen gives an example of analyzing the variance and covariance of students scores. The rows of the input data represents sets of scores from students, with each column grouped by subject matter.

The dot product between the ones matrix and A matrix will result in a 5 x 3 matrix where each column contains the total of all marks accumulated for each subject, which is then divided by the total number of tests taken for each subject, essentially providing the mean grade for each subject. The mean is then subtracted from the A matrix, producing the deviation.

The dot product of A transpose A produces the covariance matrix.

# Resulting Output
# Math, Physics English
[[ 3600.  2400. -1200.]
 [ 2400.  2000. -1400.]
 [-1200. -1400.  1400.]]

Additionally, he provided another example to work through how to create a covariance matrix for an state value.

# Resulting State Covariance Matrix:
[[0.25 0.1 ]
 [0.1  0.04]]

If the off diagonal terms were zero, it would indicate that the estimation error that contributes to one variable is independent of the other variable. In such a case, no adjustments are made to the estimates of one variable due to the estimation error of the other variable. When this is multiplied with other matrices such as updating the state and Kalman gain, the covariances won’t influence the process.

Putting it all together by tracking an aircraft with a Kalman Filter

For the final problem, Professor Biezen provided the scenario of trying to determine the position and velocity of an aircraft. For the purposes of simplicity, the problem involves only the x position and x velocity. He also zeroed out the off-diagonal values in covariance matrices, which I have also done in my code.

Kalman Filter State Matrix:
 [[5127.05898493]
 [ 288.55147059]]

First, I initialized the State matrix with values he provided. I also initialized P as the Estimation Covariance matrix, with error terms that correspond to the variance of the x position and x velocity, specific to the estimates.

Then, for each observation that was provided, I iterate through a series of processes to update the state matrix with values provided by the Kalman filter.

First, I make a prediction of where the plane will be in the next time step. This prediction is simply based on the previous position and velocity, with an acceleration parameter that adjusts the velocity. The A matrix in the prediction2d function updates the previous state based on the time that has elapsed. The B matrix translates the acceleration into an adjustment to the position and velocity. The formula 1/2 time squared is used to find a distance given an acceleration.

Then, I updated the Process / Estimation Covariance matrix to the next time step, predicting it forward. The operation adds a time step to the matrix, subsequently updating the variance of the distance error. The A matrix is similar to the one used in predicting the State matrix values. Since Professor Biezen eliminated the off-diagonal values, I did the same.

Calculating the Kalman gain involved calculating the covariance matrix for the observation errors, and using it to compare with the process covariance matrix. In the problem, there’s probably more matrix rotation than what’s required, but I believe it’s meant to make the formula invariant to different matrix sizes. There is no division in matrix operations, so to find the ratio I used the dot product with the inverse of what would otherwise be the denominator.

Notice that in matrix format, the Kalman gain is a matrix of the same dimension as the inputs, and along the diagonal are weights that adjust the observed position and velocity. The lower the weights, the lower the model trusts the observations compared to the predictions.

With the newly calculated Kalman gain, I used it to weigh the difference between the observation data with the prediction, which was then used to update the state matrix. I also used the Kalman gain to update the process covariance matrix.

I did that as many times as observations that were provided, and got a result that is somewhere between the observation data and predictive model, which is exactly what one should expect from using the Kalman filter. My exact results were slightly different than Professor Biezen, but note that he did commit a number of errors in his calculations and did not do a full run through of all the observations, so I’m confident my calculations were more accurate.