What Are Nonparametric AI Models?

Including K-Nearest Neighbors (KNN) implementation in scikit-learn (sklearn)

Oliver Lövström

Follow

Published in

Internet of Technology

5 min readJun 12, 2024

--

Nonparametric AI Models

Nonparametric AI models, such as lookup tables and K-Nearest Neighbours (KNN), are supervised machine learning algorithms known for their simplicity, flexibility, and adaptability. This article compares parametric and nonparametric AI models, discusses and implements the KNN algorithm, and describes when and when not to use nonparametric models.

Parametric vs. Nonparametric AI Models

Parametric models, such as linear regression, use training data to estimate a set of parameters. In contrast, nonparametric models retain all the data points as part of the model, allowing them to adapt to the data’s structure. These types of models are often referred to as instance-based learning methods.

Lookup Tables

The simplest form of a nonparametric model is a lookup table. Imagine we have the following observations of house prices in an area. To estimate the price of a house, we look at the table. For example, a 1500 sq ft home costs $300,000.

The problem with lookup tables is that they do not generalize well to new data. If we instead want to estimate the price of a 1600 sq ft home, the table lacks this information.

K-Nearest Neighbors

To solve this generalization problem, we can use the nonparametric model K-Nearest Neighbors. The KNN finds the K closest data points to the target and uses the mean or median of the neighbors to estimate the prediction. This approach assumes that similar data points will have similar target values.

As the name suggests, we must measure the distances between the data point and the K points in the underlying dataset. One common measure of the distance is Minkowski, the same as the Euclidean distance in a two-dimensional space.

Example: Estimating House Prices Using KNN

To solve the initial problem, where we want to find the price for a 1600 sq ft home, we take the mean of the two closest neighbors.

Solving House Prices Using K-Nearest Neighbours. Image by Oliver Lövström.

The closest points are (1500 sq ft, $300,000) and (1800 sq ft, $350,000), which are visualized in the graph above. The estimated price is:

Solution Using 2-Nearest Neighbors. Image by Oliver Lövström.

We have already encountered a problem using nonparametric modeling here. For example, estimating the price of a 1700 sq ft home would yield the same result since the two nearest data points are the same. This can be solved by having more data.

Coding

However, access to more data makes the calculations more difficult by hand, especially if we have a problem with more than two dimensions. To solve this, we will use coding. Specifically, we will use Pandas and scikit-learn to implement the K-Nearest Neighbors regression algorithm. To illustrate this, we will use a larger dataset for the same problem. The data is visualized in the image below.

House Price Observations. Image by Oliver Lövström.

We start by loading the data using Pandas:

import pandas as pd

data = pd.read_csv("data/houses.csv")
print(data)
print(data.shape)

Size          Price
0    3674  701480.235685
1    1360  278590.638446
2    1794  358791.708572
3    1630  303157.220416
4    1595  264450.702912
..    ...            ...
995  3677  622883.786597

(1000, 2)

Next, we split the data into features and labels, where house sizes are the features and prices are the labels (targets):

sizes = data["Size"].values.reshape(-1, 1)
prices = data["Price"]

print(sizes.shape)
print(prices.shape)

(1000, 1)
(1000,)

We continue training the KNN model and predicting the prices for 1600 sq ft and 1700 sq ft homes:

import numpy as np
from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=3)
model.fit(sizes, prices)

test_set = np.array([[1600], [1700]])
predictions = model.predict(test_set)

for size, price in zip(test_set, predictions):
    print(f"Predicted price for size {size[0]} (sq ft): ${price:.2f}")

Predicted price for size 1600 (sq ft): $292702.17
Predicted price for size 1700 (sq ft): $336948.93

Finally, we plot the result using Matplotlib:

import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(10, 6))
plt.scatter(sizes, prices, color="blue", label="Data Points")

plt.scatter(test_set, predictions, color="magenta", s=200, edgecolor='black', label="Predicted Points")
plt.title('House Prices Based on Size')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)

plt.show()

Solving House Prices Using KNN. Image by Oliver Lövström.

Applicable Use Cases for Nonparametric Models

This section will mostly refer to the advantages and disadvantages of using KNN for regression and classification. However, most of these apply to other types of nonparametric models.

Advantages

Simplicity: Nonparametric models such as KNN are simple to implement and tune, with few hyperparameters to adjust.
Flexibility: When adding more data, the algorithm automatically adjusts itself. Improving the predictions.

Disadvantages

Memory: Nonparametric models store all training data, which can become memory-intensive when using larger datasets.
Scalability: Predictions become much smaller with larger or higher dimensional datasets.

Practical Examples

Recommendations and Similarity: Nonparametric models are great for recommendation systems since they are easy to implement and flexible. For example, recommending products based on previous purchases.
Natural Language Processing: Nonparametric models are less effective for tasks that require larger datasets, such as natural language processing, ChatGPT, or Gemini.

What Are Nonparametric AI Models?

Including K-Nearest Neighbors (KNN) implementation in scikit-learn (sklearn)

Nonparametric AI Models

Parametric vs. Nonparametric AI Models

Lookup Tables

K-Nearest Neighbors

Example: Estimating House Prices Using KNN

Coding

Applicable Use Cases for Nonparametric Models

Advantages

Disadvantages

Practical Examples

Further Reading

What Are Linear AI Models?

Linear regression explained, including scikit-learn (sklearn) implementation.

GPs vs. Linear Regression vs. XGBoost

Pros and Cons of Gaussian Processes, Linear Regression, and XGBoost. Implementing GPs, Linear Regression, and XGBoost…

Machine Learning

Offered by Stanford University and DeepLearning.AI. #BreakIntoAI with Machine Learning Specialization. Master…

Written by Oliver Lövström