Mushroom Dataset — Data Exploration and Model Analysis (OneHot Encoded)

Nitin Vashisth
Analytics Vidhya
Published in
7 min readApr 6, 2020

--

Photo by Chris Ried on Unsplash

This article is going to provide excellent exposure to different data exploration techniques for the categorical dataset. We are going to demonstrate techniques to deal with null values (if any), conversion of categorical values to numerical features and clustering tendency check for the dataset.

If you are looking for the entire code repository, it can be found here : Classification and Model Analysis (one-hot vector).

To demonstrate all of the methods, we considered the dataset from a famous Kaggle competition. A detailed description of the dataset is available on the website (Kaggle), which explains individual features explicitly. A Data Scientist must understand the background of the data to start structuring his/her hypothesis around it.

Let us first read the file and view top 5 record for the data.

Reading mushroom dataset and display top 5 records

Let us explore the data in detail (data cleaning and data exploration)

Data Cleaning and Data Exploration

Usually a Data Scientist will always check for “NA” values present in the dataset and we did same here. Also we will check, distinct available classes into which mushroom need to be classified. In our case, the classes are “poisonous” and “edible”.

Count available “NA” from each column

We need to pre-process the data which can be ingested by the machine learning model. We have different categories present in the dataset, it impossible to use these categories for visualisation and model analysis. Hence we need to convert this data into numerical data. We have used great library from Sklearn LabelEncoder to convert all categories into numerical values.

LabelEncoder to convert all categories into numerical values

Now we have to do feature analysis which can be done using correlation matrix. It provide good information about the positive and negative correlation between different features. Also we can check for outlier using box plot.

Outlier box plot for Mushroom dataset
Correlation Matrix depicting correlation among features

After we plotted the box plot and correlation matrix, it gave us the sense that the data can not be label encoded. These method uses different method like mean, covariance and other mathematical technique to find relation among features. And our dataset is categorical data for which it does not makes to calculate a mean or variance. Hence from this post, this provided best information how a categorical feature (nominal or ordinal) can be converted to numerical feature (OrdinalEncoder, OneHotEncoder) based on the dat

One-hot Encoder for the nominal categorical data

But there is always few disadvantages of the one-hot encoder. It significantly increases the dimensionality. Hence we need to look for the possible ways for our data analysis which can reduce these dimensionality. We went through Principal component analysis (PCA) and results can be seen below. Here we can see that, the maximum variance has been captured by 45 components.

Principal Component Analysis (PCA) to reduce dimensionality

We also checked the tendency of our data which can be clustered with the help of Hopkin’s statistics (to read in detail click here). It can be clearly seen that, the dataset is indeed highly cluster-able with value of 0.98.

Hopkin’s statistics for Mushroom dataset

We also looked for the Silhouette score (to read in detail click here) and from the graph, it can be clearly seen that, the highest peak is at value more than 2. It can be inference that, we can have 2 or more cluster. We will perform another test called Elbow Test (to read in detail click here), which will be give us the count of possible clusters.

Elbow Method (Possible clusters 2) and the Silhouette score (2 or more cluster)

Then we plotted 3 and 5 cluster using K-means algorithm. Below can be seen for the implementation.

Model Analysis and Comparison

Model selection has been done based on the popularity and often used in real cases. Below are the some technique used in order

  • K-fold Cross Validation: It is method to resample limited data sample which will help to evaluate machine learning models. This method has a parameter as ‘K’ that can be considered as number of groups in which 80:20 ratio is taken for training and testing respectively and this repeats with different group.
  • Precision and Recall: It is always wise to check for true case ration with respect to false positive (TPR/(TPR+FPR)) also called as precision value. Recall can be said as the percentage of total relevant results correctly classified by the algorithm (TPR/(TPR+FNR)).
  • F1 Score: One always look forward to have model with highest precision and recall. But when it is not possible to decide, then we must go for F1 score which is the harmonic of precision and recall. (2*Precision*Recall/(Precision+Recall))
  • ROC: The Receiver Operating Curve (ROC) is plotted with TPR (True positive rate) against the FPR (False Positive Rate) where TPR is on y-axis and FPR is on the x-axis.

Above criteria will be used in all selected algorithm out of which final algorithm will be selected.

  • Gaussian Naive Bayes Classifier
  • Logistic Regression Classifier
  • Decision Tree Classifier
  • Random Forest Classifier
  • xgBoost Classifier
  • Linear Discriminant Classifier
  • Gaussian Process Classifier
  • Ada-boost Classifier

Now we have done with the data exploration, now let us move forward with model analysis. We have considered different many model for analysis. Below is the implementation of different machine learning algorithm used for Mushroom classification.

# Importing required classification algorithmfrom sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, auc, classification_report
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.gaussian_process import GaussianProcessClassifier
import xgboost
from sklearn import tree
from sklearn import metrics
# Creating object for each class and storing into the array
classifiers = []
nb_model = GaussianNB()
classifiers.append(("Gaussian Naive Bayes Classifier",nb_model))
lr_model= LogisticRegression()
classifiers.append(("Logistic Regression Classifier",lr_model))
dt_model = tree.DecisionTreeClassifier()
classifiers.append(("Decision Tree Classifier",dt_model))
rf_model = RandomForestClassifier()
classifiers.append(("Random Forest Classifier",rf_model))
xgb_model = xgboost.XGBClassifier()
classifiers.append(("XG Boost Classifier",xgb_model))
lda_model = LinearDiscriminantAnalysis()
classifiers.append(("Linear Discriminant Analysis", lda_model))
gp_model = GaussianProcessClassifier()
classifiers.append(("Gaussian Process Classifier", gp_model))
ab_model = AdaBoostClassifier()classifiers.append(("AdaBoost Classifier", ab_model))
# Stores all the scores
cv_scores = []
names = []
for name, clf in classifiers:
print(name)
clf.fit(X_train, y_train)
y_prob = clf.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
print("Model Score : ",clf.score(X_test, y_pred))
print("Number of mislabeled points from %d points : %d"% (X_test.shape[0],(y_test!= y_pred).sum()))
scores = cross_val_score(clf, features, label, cv=10, scoring='accuracy')
cv_scores.append(scores)
names.append(name)
print("Cross validation scores : ",scores.mean())
confusion_matrix=metrics.confusion_matrix(y_test,y_pred)
print("Confusion Matrix \n",confusion_matrix)
classification_report = metrics.classification_report(y_test,y_pred)
print("Classification Report \n",classification_report)

After execution of above code on the mushroom dataset, we received different results for algorithms which can used to compare these algorithm. The final algorithm can be considered as the go-to algorithm for deployment.

Gaussian Process, Adaboost, LDA, Logistic Regression and Decision Tree Classifiers Evaluation
Naive Bayes, Random Forest, XG Boost Classifiers Evaluation

The main take away from this article is different techniques to handle the categorical data i.e. LabelEncoder, OrdinalEncoder and OneHotEncoder. Also we also spoke about Principal Component Analysis, where we considered only 2 components out 117 component, hence effectively reduced the dimensionality. Later we looked into different method which could tells us about the tendency of clustering in the dataset. Hopkin’s statistics, Silhouette Score and Elbow method showed us the way toward clustering. At last we used K-means clustering method. Then finally we did model analysis using different evaluation methods as we discussed above.

If you like the data exploration and model analysis, then please clap, share and comment for feedback. Stay tuned for more blogs!

--

--