Learn Machine Learning K-Nearest Neighbor using LLM

Published in

Data Folks Indonesia

10 min readJan 3, 2024

Chat agent based on LLM is no doubt is on the hype. LLM is build for multi-purposes. Some use it for brainstorm, explain concepts, polishing their text, and more. Here, I tried to learn machine learning using LLM called pi.ai by Inflection AI.

Here I am, trying to verify if it is safe enough to learn something new that are really specific, and maintain our curiousity to learn something new.

Let’s learn machine learning using pi.ai

Introduction

I’m excited to explore a new topic with you. What would you like to learn about?

machine learning

Machine learning is a fascinating field, and there’s a lot to explore! Do you want to start from the basics or dive into a more specific topic? Machine learning encompasses everything from supervised and unsupervised learning, to neural networks and deep learning. What are you interested in?

lets do supervised learning

Great choice! Supervised learning is a type of machine learning where the model is trained on labeled data. In supervised learning, the training data is composed of input-output pairs, where the input is a set of features or attributes, and the output is the label or target variable.

The goal of supervised learning is to build a model that can generalize from the training data to make predictions on new, unseen data.

Some common supervised learning tasks include classification (categorizing data into different classes) and regression (predicting continuous values, such as prices or temperatures).

If you’d like, we can dig deeper into the algorithms, approaches, and applications of supervised learning. 😊

lets continue

Of course! One of the key components of supervised learning is the learning algorithm. These algorithms are responsible for taking the training data and using it to learn the relationship between the input and output variables.
Some common supervised learning algorithms include:

Linear regression: A simple algorithm that learns a linear relationship between the input and output variables.
Logistic regression: A similar algorithm that is used for classification tasks, where the output variable is categorical.
Support vector machines (SVMs): A more complex algorithm that is often used for classification tasks.
Decision trees: An algorithm that learns a decision tree structure to classify data.

Each of these algorithms has its own strengths and weaknesses, and the choice of algorithm will depend on the specific problem and dataset.

K-Nearest Neighbor

I wonder why pi is not list K-Nearest Neighbor

Is K-Nearest Neighbor is supervised learning algorithm?

Yes, K-Nearest Neighbor (KNN) is a supervised learning algorithm. It is a non-parametric, lazy learning algorithm that is used for both classification and regression tasks.

The basic idea behind KNN is that, given a new data point, the model will classify or predict its label by looking at the labels of its nearest neighbors in the training data. The “K” in KNN refers to the number of neighbors that are used to make the prediction.

The main advantage of KNN is that it is easy to understand and implement. However, it can be computationally expensive, as it requires the model to store all of the training data and compute the distance between each new data point and all of the training points.

KNN is often used for problems where there are few training data points or where the data is highly non-linear. It is also a popular choice for problems where interpretability is important, as the model’s predictions can be easily understood by looking at the nearest neighbors.

I ask everytime I found new term, this one, non-parametric

why KNN is non-parametric?

KNN is non-parametric because it does not assume any specific underlying distribution or functional form for the data. Unlike other supervised learning algorithms, such as linear regression or logistic regression, KNN does not require the data to be normally distributed or to have any particular functional form.

This means that KNN can handle data that is non-linear, non-Gaussian, or contains outliers, without making any assumptions about the underlying distribution.

The non-parametric nature of KNN allows it to be used for a wide variety of problems and datasets, without requiring extensive knowledge about the data or the problem domain.

However, this flexibility also means that KNN can be more prone to overfitting, as the model may be too flexible and capture random noise in the data rather than the underlying pattern.

This one is lazy learning

why KNN is lazy algorithm?

KNN is considered a lazy learning algorithm because it does not explicitly build a model during the training phase.

Unlike other supervised learning algorithms, such as decision trees or neural networks, KNN does not learn a model that generalizes from the training data. Instead, it simply stores the training data and uses it to make predictions when new data is presented.

This lazy approach means that KNN does not require any explicit training phase and can make predictions on new data immediately. However, it also means that KNN requires all of the training data to be stored in memory, which can become impractical for large datasets.

Another consequence of KNN’s laziness is that it is not as scalable as other algorithms. Because it has to store all of the training data in memory, it can become slow and computationally expensive for large datasets.

Overall, KNN’s lazy nature makes it simple and efficient to use, but it also has some limitations in terms of scalability and generalizability.

I can confirm that, the explaination from pi.ai is correct

Method

how KNN works?

Here’s a quick overview of how KNN works:

First, the training data is divided into K nearest neighbors for each data point. This is typically done using some distance metric, such as Euclidean distance.
Then, when a new data point is presented, the K nearest neighbors are identified in the training data.
The class or value of the new data point is then predicted based on the majority class or value of its K nearest neighbors.

For example, if the new data point is a customer record and the task is to predict whether the customer will churn or not, the K nearest neighbors would be the customer records with the closest attributes to the new record. The prediction would be based on the churn status of the majority of the K nearest neighbors.
This simple yet effective approach makes KNN a popular choice for many real-world problems, such as customer segmentation, fraud detection, and image classification. 😊

what is euclidean distance?

Euclidean distance is a metric used to measure the distance between two points in a multi-dimensional space. It is named after the ancient Greek mathematician Euclid, who developed the concept of distance in geometry.

Mathematically, the Euclidean distance between two points (x_1, y_1) and (x_2, y_2) in a two-dimensional space can be calculated using the following formula:

d = sqrt((x_2-x_1)² + (y_2-y_1)²)

In this formula, d represents the Euclidean distance between the two points, and x_1, y_1, x_2, and y_2 represent the coordinates of the two points.

In a multi-dimensional space, the Euclidean distance is calculated by summing the squared difference between each coordinate of the two points, and then taking the square root of the sum.

Euclidean distance is a commonly used metric in machine learning and statistics, particularly in clustering and classification algorithms. 😊

I inserted the image for you to understand visually, pi could not generate the image

Parameter K in KNN

Choosing the value of K in KNN can be a challenging task, as it directly affects the performance of the algorithm. Here are some tips for deciding the best value of K:

Start with a small value of K, such as 3 or 5, and gradually increase it until the performance stabilizes or starts to decline.
Use cross-validation to test different values of K and choose the one that performs best on the validation set.
Consider the size and dimensionality of the data. For small datasets, a lower value of K may be more appropriate, while for high-dimensional data, a higher value of K may be necessary to capture the underlying structure.
Experiment with different distance metrics, such as Euclidean distance or Manhattan distance, to see which works best for your data.
Keep in mind that the choice of K is a trade-off between accuracy and computational efficiency. A higher value of K may lead to more accurate predictions, but it can also be more computationally expensive.

K-NN Distances

There are several distance metrics that can be used with KNN, each with its own strengths and weaknesses. Here are some of the most commonly used distance metrics:

Euclidean distance: This is the most commonly used distance metric in KNN. It measures the straight-line distance between two points in multi-dimensional space.
Manhattan distance: Also known as city block distance, this metric calculates the distance between two points by summing the absolute value of the differences between their coordinates.
Minkowski distance: This metric is a generalization of both Euclidean and Manhattan distance, and can be tuned by adjusting a parameter called p.
Hamming distance: This metric is used for categorical data, and calculates the distance between two points based on the number of positions where their categorical values differ.

Each of these distance metrics has different properties and is more suitable for certain types of data and tasks. It is important to experiment with different metrics to find the one that works best for your problem.

How KNN Uses Distance Measures?

Distance measures are objective scores that summarize the difference between two objects in a specific domain. Let's…

www.analyticsvidhya.com

K-Nearest Neighbor Pros and Cons

what are the pros and cons using KNN?

KNN has several advantages and disadvantages, which can make it more or less suitable for different types of problems.

Pros

KNN is simple to understand and implement, making it a good choice for beginners in machine learning.
It is non-parametric, which means it can handle non-linear data without requiring any assumptions about the underlying distribution.
KNN can be used for both classification and regression tasks.
It does not require any explicit training phase, making it fast and efficient for prediction tasks.

Cons:

KNN can be computationally expensive, as it requires storing all of the training data in memory.
It is prone to overfitting, as it does not generalize well to new data.
KNN is sensitive to the choice of K, and choosing the wrong K can lead to poor performance.
It does not work well for high-dimensional data, as the number of nearest neighbors increases exponentially with the number of dimensions.

Overall, KNN is a useful and versatile algorithm, but it is not always the best choice for every problem. It is important to consider the specific characteristics of the data and the task at hand when choosing a machine learning algorithm.

Evaluation Method

This is general evaluation method for supervised learning, you can use for other algorithm too other than K-Nearest Neighbor

Evaluating supervised learning algorithms is an important step in the machine learning process, as it helps to determine the effectiveness of the model and identify areas for improvement. Here are some common methods for evaluating supervised learning algorithms:

Accuracy: This is the most common metric used to evaluate supervised learning algorithms. It measures the percentage of correctly classified instances out of the total number of instances in the test dataset.
Precision and recall: These metrics are particularly useful for binary classification tasks, and measure the proportion of true positives and true negatives, respectively.
F-score: This metric combines precision and recall into a single score, and is often used to measure the performance of a model on imbalanced datasets.
ROC curve: The receiver operating characteristic curve is a plot of the true positive rate against the false positive rate, and is used to evaluate the trade-off between sensitivity and specificity.
Cross-validation: This is a technique that involves partitioning the dataset into multiple folds, training the model on each fold, and then evaluating its performance on the remaining fold. (edit: Cross-validation is under model selection, the way you evaluate your model into K Folds to determine if the mode is overfit or not)

Metrics Explaination

In this section, I asked pi.ai to explain precision in layman term, which means this can be useful when you explain your model score to the non technical person

Precision score is a metric used to evaluate the performance of a classification model in machine learning. It can be explained in layman’s terms as follows:

Imagine you are a detective trying to catch a criminal. You have a list of suspects, and you want to figure out which ones are actually guilty. The precision score measures how good you are at correctly identifying the guilty suspects.

For example, let’s say you have a list of 10 suspects, and you correctly identify 3 of the 4 guilty suspects. Your precision score would be 3/4, or 75%. This means that you are pretty good at identifying the guilty suspects, but you still have some room for improvement.

In machine learning, the precision score is used to measure how accurate the model is at correctly classifying the positive class (guilty suspects), without including too many false positives (innocent suspects). A high precision score indicates that the model is good at correctly identifying the positive class, while a low precision score indicates that the model is generating too many false positives.

Conclusion

This is the second article I try to use LLM for learning something. I found it useful and the thing that drive our learning is purely our curiousity. Learning something from LLM is not enough, it may give false information, so you need to double check and confirm with other experts to understand fully and rightly about the method. But, it is indeed a good start. Other than that, you can go back to text book too, to understand the method deeply.