LOESS

Smoothing data using local regression

João Paulo Figueira
Towards Data Science
11 min readMay 24, 2019

--

Photo by Vinícius Henrique on Unsplash

If you are sampling data generated from a physical phenomenon, you will get noise. Noise can be added to the signal by the sensor measuring it, or it can be inherent to the stochasticity of the process that generates the data. I recently had to handle one such noisy data stream generated by a vehicle engine and needed to figure out a way to filter out the noise. Due to the physical nature of the signal generation process, the sampling frequency was not constant, thereby precluding any frequency-based noise filtering technique. I needed to find a way to filter out the noise and recreate the signal for further processing.

In order to recover the signal from the measured noise, we must start by making a few assumptions about how the noise is generated. In statistical terms, this means that we must assume some distribution for the noise, a mathematical description of how it is generated. The most common assumptions involve random noise that is generated according to a Gaussian distribution, an additive model where the noise is added to the signal, and an error term that is independent of x, like so:

Additive noise

Incidentally, this is the very same noise model that is generally assumed by the linear regression model. Here the noise model looks like this¹:

The linear regression noise model

In a sense, by fitting a model to your data, you are trying to recover the underlying signal from the measured noise that is realized by the data variance.

So, should I use linear regression to smooth the signal? If the signal itself can be modeled by a linear function, that might possibly have non-linear terms, then the answer would be yes. But in this particular scenario, I would be handling a highly non-linear signal that reflected the daily operation of a distribution vehicle: substantial periods of flatness interspersed with variable-length periods of pure madness. Handling the whole signal with a single linear regression model was out of the question. But…

What if instead of tackling the whole signal with the same model, I would use different models of the same type to smooth out small and localized areas of the signal? And why not take this idea a bit further and consider a specialized model for each point we need to smooth out?

Think Local

Looking at my bag of tricks, I found an old friend: LOESS — locally weighted running line smoother². This is a non-parametric smoother, although it uses linear regression at its core. As with any smoother, the idea of this algorithm is to recover the inherent signal from a noisy sample.

So how does LOESS work? Let’s start with a noisy signal like the one below.

Noisy signal

This is a synthetically generated sine wave with added Gaussian noise. The sine wave is drawn in red while the noisy samples are displayed as blue dots. To simulate an irregularly sampled signal, the x values were randomly sampled from a uniform distribution and scaled appropriately. The corresponding y values were calculated using a sine function with added Gaussian noise.

So how do we get from the blue dots to an approximation of the red line? First of all, think of the red line as an ordered sequence of equally spaced x values, in this case between 0 and . For each of these values, select an appropriate neighborhood of sampled points, and use them as the training set for a linear regression problem. With the resulting model, estimate the new value for your point. Let us explore this idea in a bit more detail

The Idea Behind LOESS

This algorithm estimates the latent function in a point-wise fashion. For each value of x, we estimate the value of f(x) by using its neighboring sampled (known) values. This is quite similar to a KNN algorithm, where k, the window size, is a tunable parameter and, in this particular case, will determine the smoothness of the resulting estimate. In a sense, k is your bias vs. variance knob. Large values of k will result in higher bias and lower values will induce higher variance.

The first step is to collect the value of x for which we want to estimate y. Let’s call these x’ and y’. By feeding the LOESS algorithm with x’, and using the sampled x and y values, we will obtain an estimate y’. In this sense, LOESS is a non-parametric algorithm that must use all the dataset for estimation.

Now that we have x’, we must find its k nearest neighbors using a simple Euclidean distance. Let’s call the resulting ordered set D.

The next step converts the set D of k distances into an ordered set W containing weights that will be later used in the linear regression process. These weights are calculated using a specialized weight function that assigns importance to each of the k neighbors of x according to its distance to x’.

The Weight Function

Distance weights are calculated using the tri-cubic function:

Tri-cubic weighting function

This function looks like a hat and has positive values only between -1 and 1. Outside of this interval, the function is zero. Here is what the function looks like:

Graph of the tri-cubic weight function

As this function only has positive results for -1 < x < 1, we must normalize the distance by dividing it by the maximum value observed in D. More concretely,

Weighting function

Here, we denote d(x, x’) as the distance between x, one of the k nearest neighbors, and x’. The effect of normalization is that larger distances will be associated with lower weights. At the very extreme, the point corresponding to the maximum distance will have a weight of zero, and the point at zero distance will have the highest possible weight — one. That is how the “locality” effect is achieved, by assigning higher importance to the training data that is closest to where we want the prediction to be calculated.

As a side note, you may find that this function has a striking similarity to the tri-cubic kernel function. The difference in scale (70/81) between these functions relates to the requirement that a kernel function must integrate to one over its domain, while here that requirement is relaxed.

We are now ready to calculate the estimate using a simple weighted linear regression that is trained with the x values from D, and the corresponding y values.

Linear Regression

Linear regression is the bread-and-butter of supervised machine learning methods. Odds are, you started your ML journey learning the innards of this method, probably trying to figure out the sale price for households in Portland, given their physical features. Or maybe it was something else entirely, but you know the drill, don’t you?

It so happens that a specialized version of linear regression, weighted linear regression, is at the heart of LOESS. For every point that we set out to estimate (x’), the LOESS algorithm must set up a linear regression model that will calculate the corresponding output (y’), using the k nearest neighbors of x’ and a set of weights that rates their importance.

The local linear regression usually models low-dimensional polynomials, a line or a quadratic.

The first-degree regression equation
The second-degree regression equation

Weighted linear regression is a known problem and is abundantly documented online. Due to the typical low dimensionality of the problems that will be tackled, we will resort to the closed-form normal equations for parameter estimation. In the unweighted case, these equations are:

Normal equations for linear regression

Were beta is the vector of linear parameters, X is the matrix containing all x observations, arranged like so:

The X matrix

Concretely, this matrix models a sample with n dimensions and m observations. Note that I am including the intercept term in the matrix through the first column. For the case when we are modeling a second-degree polynomial, this matrix is actually:

Second-degree X matrix

Once we have the beta vector, new values of y can be calculated using the following equation:

Extending this concept to using weights is actually quite simple and the normal equation just needs an extra term:

Weighted normal equation

Here, the weight matrix W has all the calculated weights in the diagonal with all other elements set to zero. Unfortunately, as you will see in the implemented Python code, the matrix approach can be a bit slow. If you stick to the first-degree model, an alternative approach can be taken using simpler math:

Looks complex but the implementation is far simpler through the use of internal products of vectors to eliminate explicit sums. Notation note: d stands for the number of items in D, which is actually k.

On to the code.

Python Libraries

You can find an implementation of this smoother in the StatsModels Python package. By reading through the method documentation, you see that lowess function returns an array with the same dimension as the two input arrays (x and y). This means that only the observed values are smoothed so if you need any other values in between, you will have to somehow interpolate them. But this does not have to be this way.

Python Implementation

For this article, I developed a new implementation based on NumPy that leverages its vectorization features, and the code can be found in this GitHub repository. The code was developed with vectorization in mind and there is only one loop in the function that determines the indexes of the closest values. Let us step through the code and see how it works.

The tri-cubic weighting function is fully vectorized and it processes arrays of x values. First, the output array y is created with the same dimensions as the input array x. Next, an indexing array is created to enforce the function’s domain and finally, the function itself is calculated. Note that the indexing array is used on both the input and output arrays. My first approach was to vectorize the code using Numba, but then I realized that this approach had the same performance, and did away with the unnecessary compilation.

Upon initialization, both input arrays must be normalized to avoid problems of loss of significance (aka, catastrophic cancellation). This is done quite simply with a rescaling to the interval between zero and one.

Weights are calculated from the array of distances with the help of an indexing array, that contains the indexes of the minimal-distance window.

This indexing array is calculated in the next function:

In order to calculate the range with the minimum total distance to x’, we start by determining the index of the minimum distance within the distances array. Knowing that the indexes must be consecutive, we can use this initial index as the root of a growing list of indexes. The tests at the top of the function just handle the edge cases when the minimum index is at the extremes of the distances array. The following loop grows the list of indices, starting from the index of the minimal distance, adding items left and right as needed and keeping the list naturally sorted, inserting to the left and appending to the right. Note that the number of loops is limited to k-1.

Now, we get to the heart of the code. The function that estimates f(x) can be used in two modes: matrix or statistical. In matrix mode, you can specify a polynomial degree but will have lower performance. The statistical code is faster but only models lines.

The function starts by normalizing the input x value and calculating its distance to all the training values. The array of distances has the same dimension as the training data. Next, the minimum distance range is found and the corresponding weights calculated. Note that the array of weights has k (the window size) items. Finally, the regression is trained and the estimated value for f(x) is calculated using either of the methods described above. Please note that if you want to use a polynomial regression the code will use “matrix mode”.

Finally, here’s a sample of how to use the code (data values are taken from NIST):

Please note that you can provide values of x other than the ones in the training data.

Here’s an example of a smoothing function on the same data as the first chart’s:

Interpolated function in black

You can play with this chart using the companion notebook in the GitHub repo.

Conclusion

We have reviewed the rationale for using the LOESS local regression model and revealed how it works. We have also developed and presented a Python implementation that heavily uses the NumPy library and its vectorization feature. Please download the code from the GitHub repository and let me know your thoughts in the comments.

Thank you for your time!

There is an update to this article that rewrites the code in Rust:

Notes

  1. I found this definition in [1]. The author does not mention the “LOWESS” term.

References

[1] Gareth, J. Witten, D. Hastie, T. Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. New York: Springer

[2] Alpaydın, E. (2014). Introduction to machine learning. 3rd ed. Cambridge, Massachusetts: The MIT Press.

[3] Starmer, J. (2017). StatQuest: Fitting a curve to data, aka lowess, aka loess, YouTube.

Review History

2024–08–15: I referenced the article that ports the Python code to Rust.

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

João Paulo Figueira
João Paulo Figueira

Written by João Paulo Figueira

Addicted to math and data, slightly off-centered. Data Scientist at tblx.io

Responses (6)