The data set we will work with in the KNN application is a telecommunications provider data. Based on this data, let’s assume that the telecommunications provider segments its customer base according to service usage patterns and divides the customers into four groups. Demographic data can be used to predict which group the customer is in and special campaigns and marketing strategies can be developed. For example, there may be a focus on using demographic data such as region, age, and marital status to predict usage patterns. The target variable, called Custcat, has four possible values corresponding to four customer groups:
- Basic Service
- E-Service
- Plus Service
- Total Service
Our goal is to create a KNN model to predict the class of unknown situations. Let’s start now.
Let’s start by importing the necessary libraries first.
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Let’s continue with the data reading and preprocessing steps. If the data set contains missing data, we can detect it with “.isnull()”. We can reach the total number of missing data with “.sum()”. We said that our target variable is “custcat”. Let’s also look at how many values there are in the classes.
df = pd.read_csv('telecommunication.csv')
df.head()
df.isnull().sum()
df['custcat'].value_counts()
X = df.drop('custcat', axis=1).values
Data Standardization is an important process, especially in distance-based methods of data points such as KNN. Because this method makes the data zero mean and unit variance. I talked about this method in detail in my article “Feature Scaling”. Click to review.
scaler = StandardScaler()
X = scaler.fit_transform(X.astype(float))
y = df['custcat'].values
We divide the data set into 80% training and 20% test data sets with the train_test_split function. The random_state parameter ensures that the random division process is repeatable. If you enter the same random_state value as me, you will get the same results.
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=12)
Now it’s time to determine our K value. We will perform 5-fold cross-validation for each of the K (number of neighbors) values from 1 to 99 and the accuracy scores will be kept in a list.
k_values = range(1, 100)
cv_scores = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
cv_scores.append(scores.mean())
To find the best K value, let’s select the K value by taking the K value corresponding to the highest accuracy score in the cv_scores list.
optimal_k_values = [k_values[i] for i in range(len(cv_scores)) if cv_scores[i] == max(cv_scores)]
print(optimal_k_values)
24
Let’s visualize the accuracy rates corresponding to the K values in order to see them in the graph.
plt.plot(k_values, cv_scores)
plt.xlabel('K Value')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Accuracy - Cross Validation Plot')
plt.show()
The data was standardized and we reached our optimal K value. Let’s finish the process by setting up the model.
knn = KNeighborsClassifier(n_neighbors = optimal_k_values[0]).fit(X_train,y_train)
Let’s create a new variable and assign our prediction values to this variable.
y_pred = knn.predict(X)
df["custcat_predict"]=pd.DataFrame(y_pred)
df[["custcat","custcat_predict"]].head(15)
Let’s calculate the accuracy score (accuracy_score) between the actual and estimated values, create a confusion matrix and visualize it as a heat map with sns.heatmap().
acc = round(accuracy_score(y, y_pred), 2)
cm = confusion_matrix(y, y_pred)
sns.heatmap(cm, annot=True, fmt=".0f")
plt.xlabel('y_pred')
plt.ylabel('y')
plt.title('Accuracy Score: {0}'.format(acc), size=10)
plt.show()
If we interpret the graph, my example shows that 238 in the upper left;
238 of the values belonging to class 1 were predicted as 1, that is, they were predicted correctly. Similarly, in the section labeled 12, 12 of the items belonging to class 1 were predicted as class 2, meaning they were guessed incorrectly. Rather than examining these values one by one, let’s easily evaluate the ratios in a table.
print(classification_report(y, y_pred))
Let’s evaluate the table. The precision value measures the rate at which samples predicted as positive are actually positive. That is, it shows how many of the examples the model predicts as positive are actually correct. It is a proportional expression of the interpretation we made on the heat map. 80% of those predicted as “1” are actually “1”.
In my article “Evaluating Success in Classification Problems”, I explained what the values mean. Click to review.
Recall measures how many truly positive examples were correctly predicted by the model. That is, it shows how many of the truly positive samples were detected by the model. 89% of the true “1's” were guessed correctly.
F1-score provides a balance score by taking the harmonic average of the precision and recall metrics. This metric takes into account whether both precision and recall are good. F1-score is the average of precision and recall values. If both precision and recall are high, F1-score will be high.
Support shows how many examples are in the actual dataset for each class. This value represents the number of data points belonging to that class. For example, class “1” has 266 data points.
Accuracy shows the ratio of correctly predicted samples to the total number of samples. That is, it expresses the rate of correct prediction of all classes. For example, the accuracy rate is 71%, meaning that the proportion of samples that the model predicts correctly is 71%.
Macro average takes the arithmetic average of the metric values of each class. Weighted average calculates the metric values of each class by weighting them according to the number of examples of that class. This provides a more accurate assessment in case of imbalance between classes.
As a result, it can be interpreted that the performance of the model is unbalanced between classes. Particularly low precision and sensitivity are observed for class “2”, indicating that the model has difficulty predicting this class accurately. However, since it is a beginner and is not subjected to too much pre-processing, it can be considered a normally successful model.
We reinforced our knowledge with the KNN application. Thank you for reading.