Classify different types of Glasses
This problem is related to classifying the glass based upon certain features. This particular problem can be solved by using the Machine learning algorithm like SVC, Random Forest or any other classification algorithm but I used Neural Network. However choosing the simple algorithm will be better because we don’t have the much data.
The data is available publicly over the Kaggle from here you can easily download.
The purpose of the dataset is predict the class of the Glass based upon the given features there’re around 9 features (Id number, RI, Na, Mg, Al, Si, K, Ca, Ba) In which all the columns except the Id columns plays an important role in determining the type of the Glass which also our target variable there are 7 types of glasses are in the description provided about the dataset but in a dataset of glasses we don’t have data about type 4 glass each type of glass has it’s own name but in a data the target variable in numbered from 1 to 7. So, based upon the available features we have to predict the target variable (type of glass).
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import tensorflow as tf
from tensorflow as keras
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical, normalize
from keras.callbacks import Callback, EarlyStopping
The dataset available to me didn’t contained any header if the data you have downloaded have header then ignore “header=None”
The Id column is of no use so dropping that would be a good option. One thing to notice here is that when we dropped the Id column we find out that the data contains some duplicate values as well.
Data summary is one of the useful operation for dataframes which gives us the count, Mean, Standard Deviation along with 5 number summary about the features of the data.
The 5 number summary contain:
so describe function return the 5 number summary along with other statistical methods like standard deviation, Count and Mean
There’s is no null value in a dataset.
Dropping the Id column from dataframe we left with a duplicate value so dropping that would be better option to avoid data redundancy.
How to deal with duplicate records
There’re multiple ways to deal with the duplicate records but we have adopted the approach by keeping the last rows and drooping the rows which occurred first in the dataset.
info function of the dataframe provides the concise summary of the features that how many non-null values are there, Data type of each feature and memory usage.
Pairplot shows the relations pairwise among features. Each of the features is plot along grid of axis, So each feature is plotted along the rows as well as along the column
sns.heatmap(corr_mat,annot=True,fmt='.2f',alpha = 0.7, cmap= 'coolwarm')
The distribution of the Glass type dataset which shows the distribution of each type of glass in a dataset that how many times the particular glass is occurred in a dataset. The distribution shows us that the data is imbalanced
sns.countplot(x='Glass_type', data=df, order=df['Glass_type'].value_counts().index);
Separating Features and target variable
we have separated the features and target variables all the independent variables are stored in X variable where the dependent variable is stored in y variables.
The independent variables are normalized by using the normalize function from Keras.Util API of Keras. Normalization can also be performed by using the Scikit-Learn API of Standard-Scaler or Min-Max-Scaler or Robust-Scaler there’re a lot of methods to deal with this.
Usually the normalization is performed to bring down all the features on the same scale.
By brining down all the features to same scale benefit is that model treat each feature as same.
As above from Distribution of class we can see that the classes are imbalance so if we develop the model of unbalance dataset the model will bias towards the class containing most of the samples so dealing with imbalance classes will help in developing fair model
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
x_ros, y_ros = ros.fit_resample(X, y)
Now the data is balance so we split that balanced dataset into train test and validation data. By using the Scikit-Learn API train test split twice we split the data, 75% of data as training data and furtherly we split the 25% test data into the test and validation data.
By printing the shape of the data we can see that 75% of data lies in training set and remaining 25% of data is further splitted half into testing and validation data
X_train, X_test, y_train, y_test = train_test_split(x_ros,y_ros,test_size=0.25,random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test,y_test,stratify=y_test,test_size = 0.5,random_state=42)y_train=to_categorical(y_train)
print('y_val :',y_val.shape)# Output:
# X_train : (342, 9)
# y_train : (342, 8)
# X_test : (57, 9)
# y_test : (57, 8)
# X_val : (57, 9)
# y_val : (57, 8)
Choosing a model which neither over-fit nor under-fit is all about applying hit and trail methods and choose the one which provides the better results.
One can find a good model by using the libraries of Keras to tune the parameters and find out the best number of layers. But I have adopt a simple approach because our problem is not much complex.
model = tf.keras.models.Sequential([
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])model.summary()
Before training the model on data we defined the early stopping approach, training the model again and again can be very time consuming when the epoch size is greater. So, adding the early stop was a better option to measure the validation loss, whenever the model validation loss stops improving over the 20 number of epochs which is set using the patience parameter the model will stop training. The model is fitted over the training data and validated at the same time on the validation set.
early_stop = EarlyStopping(
monitor='val_loss', #Quantity to be monitored.
mode='auto', #direction is automatically inferred from the name of the monitored quantity
verbose=1, #verbosity mode.
patience=20 #Number of epochs with no improvement after which training will be stopped
Now the model architecture is finalized and we have set the early stop as well. Now its time to train the model over the training data and validate it using validation data. The model would train for 1000 epochs if the validation loss is keep decreasing if the validation loss stop decreasing then model will stop the iterations after check further till 20 epochs
history = model.fit(X_train, y_train,
The model is successfully trained and validated over the data now we can plot the accuracy of model over the train set and accuracy on validation dataset. The below graph shows the visual representation of both the accuracies along with legends at the lower right bottom
plt.legend(['Training-Accuracy', 'validation-Accuracy'], loc='lower right')
The model loss representation can be seen in below graph along with legends
plt.legend(['Training-Loss', 'validation-Loss'], loc='upper right')
Now we’re done with all the things related to model training now it’s time to test our model on the test data which haven’t seen by the model to check how model performs over the unseen dataset. We can see that the model has achieved an accuracy of 63% of over the test data as well. The accuracy score isn’t the good option for measuring the performance of model which is trained over imbalance dataset however we have balanced the dataset so we can consider this metric reliable one.
Classification report gives an idea about the class the model has predict orrect and incorrect. The values in diagonals represent the classes that are predicted correctly whereas the values other than diagonals are incorrect predictions
y_pred = model.predict(X_test) # model predicions of test data y_pred_max = np.argmax(y_pred,axis=1) # choosing the max probability predicted by model y_test_max = np.argmax(y_test,axis=1) # selecting max from y_test for comparison confusion_matrix(y_test_max, y_pred_max)#Output
array([[1, 2, 8, 0, 0, 0],
[0, 6, 6, 0, 0, 0],
[0, 1, 5, 0, 0, 0],
[0, 0, 0, 7, 0, 0],
[0, 0, 0, 0, 8, 2],
[0, 0, 1, 1, 0, 9]])
With Neural Network we achieved accuracy of about 65% percent which is achieved by applying trail-and-errors methods to find the best architecture of the Neural Network which neither overfit nor underfit. The accuracy can be improved by adding more data as Neural Networks are considered as best for large quantity of data. for smaller amount of dataset the traditional machine learning algorithms work better. The more data we have the more it helps the model in learning best parameters.