KNN Algorithm

Viswa
7 min readJun 19, 2024
KNN

K Nearest Neighbors in Machine Learning for Classification

In the vast landscape of machine learning algorithms, the K Nearest Neighbors (KNN) algorithm stands as a versatile and intuitive tool for classification tasks. Rooted in the concept of similarity, KNN has found applications in various domains, ranging from medical diagnosis to recommendation systems. This article delves into the intricacies of the K Nearest Neighbors algorithm, its underlying principles, advantages, limitations, and best practices for implementation.

Understanding the K Nearest Neighbors Algorithm

At its core, the K Nearest Neighbors algorithm operates based on the assumption that similar data points tend to share similar labels. Given a new data point, the algorithm identifies the ‘K’ closest training data points (neighbors) based on a chosen distance metric, such as Euclidean distance. The class label of the new data point is then determined by the majority class among its K nearest neighbors.

Steps of the KNN Algorithm

1. Choose the Value of K: The value of K is a critical parameter in KNN. A small K might lead to noisy predictions, while a large K could lead to overly generalized results. The optimal K value often depends on the dataset and can be determined using techniques like cross-validation.

2. Calculate Distances: For each new data point, calculate its distance from all the training data points. Common distance metrics include Euclidean distance, Manhattan distance, and cosine similarity.

3. Select K Neighbors: Identify the K training data points with the shortest distances to the new data point.

4. Majority Voting: Determine the class label of the new data point by majority voting among its K nearest neighbors. This label will be assigned to the new data point.

Advantages of KNN

1. Simplicity: KNN is relatively easy to understand and implement, making it an ideal starting point for beginners in machine learning.

2. No Training Phase: Unlike many other algorithms, KNN doesn’t require an explicit training phase. The model is the training data itself.

3. Non-parametric: KNN is non-parametric, meaning it doesn’t assume a specific underlying data distribution. This makes it suitable for a wide range of data types.

4. Flexibility: KNN can handle multi-class classification problems and can also be adapted for regression tasks.

Limitations of KNN

1. Computational Intensity: Calculating distances for all data points can be computationally intensive, especially for large datasets.

2. Choosing K: Selecting an appropriate value for K can be challenging and may require experimentation.

3. Sensitivity to Noise: KNN can be sensitive to noisy data or outliers, which might affect the quality of predictions.

4. Curse of Dimensionality: KNN’s performance can deteriorate as the number of dimensions in the data increases, due to the increased sparsity of the feature space.

Best Practices for KNN Implementation

1. Feature Scaling: Normalize or standardize features before applying KNN to ensure that no feature dominates the distance calculations due to its scale.

2. Distance Metric Selection: Choose an appropriate distance metric based on the nature of the data. Euclidean distance is commonly used, but alternatives like cosine similarity might be more suitable for certain data types.

3. Cross-Validation: Utilize techniques like cross-validation to determine the optimal value of K and assess the model’s generalization performance.

4. Handling Imbalanced Data: If dealing with imbalanced classes, consider using techniques like weighted KNN or sampling strategies to improve prediction accuracy.

K- Nearest Neighbors Practical Implementation

This is the article where you’ll find out how to perform the KNN algoritihm in Python.

We will analyze data from the UCI Machine Learning Repository’s breast cancer dataset in an attempt to develop a predictive model for classifying the severity of cancer.

Step 1: Importing Python Libraries

The first step is to start your Jupyter notebook and load all the prerequisite libraries in your Jupyter notebook. Here are the important libraries that we will be needing for this linear regression.

  • NumPy (to perform certain mathematical operations)
  • pandas (to store the data in a pandas Data Frames)
  • matplotlib.pyplot (you will use matplotlib to plot the data)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2: Loading the Dataset

Let us now import data into a DataFrame. A DataFrame is a data type in Python. The simplest way to understand it would be that it stores all your data in tabular format.

df = pd.read_csv("Data.csv")
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values
df.info()

Step 3 : Splitting the dataset into the Training and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42)

This line imports the function train_test_split from the sklearn.model_selection module. This module provides various methods for splitting data into subsets for model training and evaluation.

Here, X and y represent your input features and corresponding target values, respectively. The test_size parameter specifies the proportion of the data that should be allocated for testing. In this case, test_size=0.25 means that 25% of the data will be used for testing, while the remaining 75% will be used for training.

The random_state parameter is an optional argument that allows you to set a seed value for the random number generator. By providing a specific random_state value (e.g., random_state=42), you ensure that the data is split in a reproducible manner

The train_test_split function returns four separate arrays: X_train, X_test, y_train, and y_test. X_train and y_train represent the training data, while X_test and y_test represent the testing data.

Step 4 : Feature Scaling

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Feature scaling is essential in machine learning when the features (variables) in a dataset have different ranges or units. This process is known as standardization or normalization and is useful for algorithms that rely on distance calculations or gradient-based optimization, as it ensures that no single feature dominates due to its scale.

In simple terms, this code snippet showcases how to standardize the features of a dataset using the StandardScaler class, a crucial preprocessing step that can lead to improved machine learning model performance by ensuring that features are on a consistent scale.

Step 5: Fitting/Training the data to KNN model

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier()
classifier.fit(X_train,y_train)

It’s important to understand that logistic regression is used primarily for binary classification tasks, where the target variable (y_train) represents two classes or categories. The model learns the optimal coefficients for the input features that maximize the likelihood of the observed data given the predicted probabilities.

The fit() method is then used to train the logistic regression model. This training process involves providing two main components: the X_train and y_train datasets. The X_train dataset includes the input features that the model will use to make predictions, and y_train contains the corresponding target values that the model will aim to predict.

Step 6 : Predicting the Test set results

y_pred = classifier.predict(X_test)
print(y_pred)

This line of code uses the predict() method of the trained model to generate predictions for the test data X_test. The predict() method takes the input features (X_test) as an argument and returns the predicted values for the target variable (y_pred).

Step 7 : Evaluating the Model Performance

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_pred,y_test)
print(cm)
accuracy_score(y_pred,y_test)

The code snippet makes use of the scikit-learn library’s metrics module to evaluate the performance of a machine learning model’s predictions. Specifically, it calculates and prints the confusion matrix and the accuracy score.

Firstly, the confusion_matrix function from the metrics module is employed. It takes two arguments: y_pred (predicted labels) and y_test (actual labels). The function constructs a matrix that summarizes the counts of true positive, true negative, false positive, and false negative predictions. This matrix provides insights into how well the model’s predictions align with the actual outcomes.

The calculated confusion matrix, stored in the variable cm, is then printed to the console, offering a visual representation of the model’s performance in a tabular format.

Subsequently, the accuracy_score function is used to compute the accuracy of the model’s predictions. This function also requires the predicted labels (y_pred) and actual labels (y_test) as inputs. It calculates the ratio of correctly predicted instances to the total number of instances, indicating the overall correctness of the model’s predictions.

An accuracy score of 0.9532 indicates that the model’s predictions were correct for approximately 95.32% of the instances in the evaluated dataset. This suggests that the model has performed well in terms of predicting the correct class labels, achieving a high level of accuracy in its predictions.

Conclusion

The K Nearest Neighbors algorithm provides a straightforward yet powerful approach to classification tasks in machine learning. Its reliance on the similarity between data points makes it a valuable tool for various applications. By understanding its principles, advantages, limitations, and following best practices, practitioners can harness the potential of KNN to build effective and interpretable classification models.

--

--

Viswa

I am a passionate writer, I specialize in crafting engaging content at the intersection of data and technology. With a focus on data science, machine learning.