Regressing With K-Nearest Neighbors

Earl Radina
Human Systems Data
Published in
2 min readFeb 28, 2017

Having never heard of, “K-Nearest Neighbors” in my statistical career before, I was interested in what this method could show that simple square regression could not. The K value of K-Nearest Neighbors (KNN) seems to be determined primarily by how many variables are used during calculation. If K=1, only one, “neighbor” would be taken into account and the resulting graph would be extremely limited in error or bias. However it would also look similar to a step graph and have the same issues, essentially appearing like a staircase stepping from data point to data point. This type of graph has very little variability and could in-theory be over-fitting your data. On the other hand, having a K=10 would average out the 10 closest data points and would create a smoother graph that’s not quite as accurate. This graph, however, would be far more open to variance in the data. This is referred to by G. James et al. (2013) as the, “bias-variance tradeoff”.

So when is KNN better than other methods of regression? According to G. James et al. (2013), it stands to reason that when the relationship between variables grow less linear, KNN can outperform other methods. However, it is vulnerable to increases in dimensions. This is due to its greatest asset becoming its greatest flaw. If data is highly spread out, any particular data point’s, “neighbors” may be so far off that it skews the resulting average. This, in turn, creates a projected best fit line that is nowhere near the actual data.

This can become a problem in which data is “clustered.” For example, longitudinal data is often cited as being clustered (Galbraith, Daniel, Vissel, 2010). Given that death is dis-proportionally spread out within the lifespan from infancy to the old-old, the data taking a look at death across the lifespan would see two clusters. One would be at the end of life and one at the beginning. If one were to attempt to use the KNN method to create a curve for this, the closest neighbors to the infant data would be in the elderly and could result in a best fit line that over estimates the death density along the lifespan.

Overall having just learned about KNN by reading this paper, I can see its benefits and drawbacks. However it seems to be very case-specific. It works best with data that only contain a few dimensions, that are non-linear in relationship, and that are not too spaced apart. Due to these drawbacks it seems that KNN is a very specialized tool that really only works in certain analyses and otherwise it might be better to just stick with a simple square regression.

Citations

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013) An Introduction to Statistical Learning: with Applications in R. New York, New York: Springer New York

Sally Galbraith, James A. Daniel, Bryce VisselA Study of Clustered Data and Approaches to Its Analysis. Journal of Neuroscience 11 August 2010, 30 (32) 10601–10608; DOI: 10.1523/JNEUROSCI.0362–10.2010

--

--