What Are Nonparametric AI Models?
Including K-Nearest Neighbors (KNN) implementation in scikit-learn (sklearn)
Nonparametric AI Models
Nonparametric AI models, such as lookup tables and K-Nearest Neighbours (KNN), are supervised machine learning algorithms known for their simplicity, flexibility, and adaptability. This article compares parametric and nonparametric AI models, discusses and implements the KNN algorithm, and describes when and when not to use nonparametric models.
Parametric vs. Nonparametric AI Models
Parametric models, such as linear regression, use training data to estimate a set of parameters. In contrast, nonparametric models retain all the data points as part of the model, allowing them to adapt to the data’s structure. These types of models are often referred to as instance-based learning methods.
Lookup Tables
The simplest form of a nonparametric model is a lookup table. Imagine we have the following observations of house prices in an area. To estimate the price of a house, we look at the table. For example, a 1500 sq ft home costs $300,000.
The problem with lookup tables is that they do not generalize well to new data. If we instead want to estimate the price of a 1600 sq ft home, the table lacks this information.
K-Nearest Neighbors
To solve this generalization problem, we can use the nonparametric model K-Nearest Neighbors. The KNN finds the K closest data points to the target and uses the mean or median of the neighbors to estimate the prediction. This approach assumes that similar data points will have similar target values.
As the name suggests, we must measure the distances between the data point and the K points in the underlying dataset. One common measure of the distance is Minkowski, the same as the Euclidean distance in a two-dimensional space.
Example: Estimating House Prices Using KNN
To solve the initial problem, where we want to find the price for a 1600 sq ft home, we take the mean of the two closest neighbors.
The closest points are (1500 sq ft, $300,000) and (1800 sq ft, $350,000), which are visualized in the graph above. The estimated price is:
We have already encountered a problem using nonparametric modeling here. For example, estimating the price of a 1700 sq ft home would yield the same result since the two nearest data points are the same. This can be solved by having more data.
Coding
However, access to more data makes the calculations more difficult by hand, especially if we have a problem with more than two dimensions. To solve this, we will use coding. Specifically, we will use Pandas and scikit-learn to implement the K-Nearest Neighbors regression algorithm. To illustrate this, we will use a larger dataset for the same problem. The data is visualized in the image below.
We start by loading the data using Pandas:
import pandas as pd
data = pd.read_csv("data/houses.csv")
print(data)
print(data.shape)
Size Price
0 3674 701480.235685
1 1360 278590.638446
2 1794 358791.708572
3 1630 303157.220416
4 1595 264450.702912
.. ... ...
995 3677 622883.786597
(1000, 2)
Next, we split the data into features and labels, where house sizes are the features and prices are the labels (targets):
sizes = data["Size"].values.reshape(-1, 1)
prices = data["Price"]
print(sizes.shape)
print(prices.shape)
(1000, 1)
(1000,)
We continue training the KNN model and predicting the prices for 1600 sq ft and 1700 sq ft homes:
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=3)
model.fit(sizes, prices)
test_set = np.array([[1600], [1700]])
predictions = model.predict(test_set)
for size, price in zip(test_set, predictions):
print(f"Predicted price for size {size[0]} (sq ft): ${price:.2f}")
Predicted price for size 1600 (sq ft): $292702.17
Predicted price for size 1700 (sq ft): $336948.93
Finally, we plot the result using Matplotlib:
import matplotlib.pyplot as plt
import numpy as np
plt.figure(figsize=(10, 6))
plt.scatter(sizes, prices, color="blue", label="Data Points")
plt.scatter(test_set, predictions, color="magenta", s=200, edgecolor='black', label="Predicted Points")
plt.title('House Prices Based on Size')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()
Applicable Use Cases for Nonparametric Models
This section will mostly refer to the advantages and disadvantages of using KNN for regression and classification. However, most of these apply to other types of nonparametric models.
Advantages
- Simplicity: Nonparametric models such as KNN are simple to implement and tune, with few hyperparameters to adjust.
- Flexibility: When adding more data, the algorithm automatically adjusts itself. Improving the predictions.
Disadvantages
- Memory: Nonparametric models store all training data, which can become memory-intensive when using larger datasets.
- Scalability: Predictions become much smaller with larger or higher dimensional datasets.
Practical Examples
- Recommendations and Similarity: Nonparametric models are great for recommendation systems since they are easy to implement and flexible. For example, recommending products based on previous purchases.
- Natural Language Processing: Nonparametric models are less effective for tasks that require larger datasets, such as natural language processing, ChatGPT, or Gemini.
Further Reading
If you are interested in reading more about parametric models:
If you want to learn more about nonparametric models, such as Gaussian Processes (GPs) and XGBoost, I suggest the following article:
If you want to learn more about programming and, specifically, machine learning, see the following Coursera course:
Note: If you use my links to order, I’ll get a small kickback.