Writing KNN in Python from Scratch
In just about 20 lines of code!
The KNN Classification algorithm is broadly divided into three methods:
- Calculating Euclidean distance between two rows (vectors) of the dataset
- Getting K nearest neighbors to the new piece of data
- Predicting the label of the new piece of data (Classify the new data point)
- Calculate Euclidean Distance
Euclidean Distance = sqrt(sum i to N (x1_i — x2_i)²)
2. Getting K nearest neighbors by sorting the euclidean distances
3. Predicting or classifying the new data point
Some Fundamental pointers about KNN:
- KNN is supervised learning (training data provides labels(Y)) and non-parametric Algorithm (there is no assumptions or defined function, it only learns from the data patterns).
- The value of K should ideally be odd. If sqrt of the number of data points is even, then add or subtract one.
- KNN requires feature scaling and is highly sensitive to outliers since it calculates the euclidean distance between the rows and would end up giving more weightage to bigger values.
- KNN can be used for continuous output variables, however, it would be KNN Regression and the output value would be calculated by taking the average of the nearest neighbors
- KNN is called Lazy Learner algorithm because it has instance-based learning, which means it does not immediately learn the model, but stores the training data and only use it while making predictions
- Choosing K value: This can be done in two ways (a) Take sqrt of the number of data points (b) Use cross-validation for hyperparameter tuning the value of K is a range e.g. (1, 21) and select the K value which gave a minimal error.
- Bias-Variance tradeoff can be noted in KNN too. Ideally the bigger the value of K, the better the prediction would be, however, this could also lead to overfitting, hence resulting in high variance.
Hope this was helpful!