Classification with Machine Learning (Python) “Mushroom Dataset”

Aysen Çeliktaş
Become Better
Published in
11 min readFeb 16, 2024

In this article, I am sharing an application on supervised machine learning for a classification problem on the data set I have chosen. You can also refer to my article “Supervised Machine Learning for Beginners” for theoretical knowledge and my article “EDA (Python)” to gain information basicly on exploring a data set through EDA.

[created by the author in Canva]

Which problems can the existing data set be used to solve? Which features lead us to meaningful information for the solution of the identified problem?

1. Data Knowledge
2. EDA
3. MODEL
4. RESULT

1. Data Knowledge

The first thing to do is to examine the data you want to classify and collect information about the data. What the classes in the data are and the features that may affect these classes are very important to make sense of the results obtained from the study. The dataset examined here was downloaded from The UC Irvine Machine Learning Repository. The name of the dataset is Mushroom. With the information obtained from The Audobon Society Field guide, mushrooms were classified according to their physical characteristics as edible or poisonous. First of all, we started by examining what the variables were. For this purpose, the properties in Table1 were edited by reading the file where the data information was kept. These are related to the physical characteristics of different anatomical parts of fungi, such as shape and color, their habitats and populations. A little research was done on what these features mean. (I have no knowledge of the literature about mushrooms.) Additionally, the file containing information about the data set was examined. Here, in the “stalk_root” property, missing data was detected to defined by “?”. We are ready to move to the Python environment.

Source: Screenshot by Author — Table1. Dataset

2. EDA

First of all, it was seen that the names of the columns where the variables were kept with their abbreviations in the Excel file. To avoid confusion when examining the dataframe data set, the names of the columns were changed to their original names and placed at the end of the target dataframe (Figure1).

mushroom = mushroom.rename(columns={'p':'target','x':'cap_shape','s':'cap_surface','n':'cap_color',
't':'bruises','p.1':'odor','f':'gill_attachement','c':'gill_spacing',
'n.1':'gill_size','k':'gill_color','e':'stalk_shape','e.1':'stalk_root',
's.1':'stalk_surface_above_ring','s.2':'stalk_surface_below_ring',
'w':'stalk_color_above_ring','w.1':'stalk_color_below_ring','p.2':'veil_type',
'w.2':'veil_color','o':'ring_number','p.3':'ring_type','k.1':'spore_print_color',
's.3':'population','u':'habitat'})
sp = list(mushroom.columns)
sp[0],sp[22] = sp[22],sp[0]
mushroom = mushroom[sp]
mushroom.tail()
Fig.1. Image of the dataset in notebook [from author’s notebook]

While examining the data, it was learned that there were missing values in the “stalk_root” variable. It was learned how many missing values were among 8123 values. It was checked whether the target’s classes were balanced or unbalanced, both before and after these values were dropped. The distribution observed in the classes after the missing values were dropped from rows has changed. 2448 showed an unbalanced distribution after missing data was dropped (Figure2). With this information in mind, this place can be re-evaluated when trying to improve the metrics received. First, we continued by dropping missing values from the rows of the entire data set. Also, at the end, the results were dropped as a column and given again. Additionally, when looking at the info() of the data, it was observed that all variables were categorical data.

mushroom["stalk_root"].value_counts()
mushroom_new = mushroom.drop(mushroom[(mushroom["stalk_root"] == "?")].index, axis=0)
(mushroom_new["stalk_root"] == "?").sum()

mushroom.target.value_counts().plot.pie(autopct='%1.1f%%')
mushroom_new.target.value_counts().plot.pie(autopct='%1.1f%%')
Fig.2. The graph on the left is the class distribution of the target variable in which the missing data is not dropped, and the graph on the right is the class distribution of the dropped version. e:edible, p:poisonous [from author’s notebook]

The distribution of each variable across the population was examined with the help of countplot. Thus, the general structure of the data set can be understood and it can be observed whether a class is over-represented compared to others. With the help of barplot, the relationship of the features to the target variable is interpreted. Here, the interact function from the ipywidgets library was used to visualize the graphics.

Although ‘s: sunken hat shape’ has a nearly 100% effect on the target variable, only 32 of 5643 mushrooms have this hat shape. When compared to other hat types, ‘f: flat hat shape’ seems to show the closest distribution throughout the data set and according to the target variable (Figure3). Such situations should be taken into account in the data set because understanding the differences between observed values can detect an abnormal situation. When features other than hat shape were observed, more suitable features were found to be used in classification. For this reason, this feature was dropped before it was given to the model.

mushroom_new["target"] = mushroom_new.target.map({"p":0, "e":1})

def … _box(col):
sns. … plot(data = mushroom_new,
y= mushroom_new[col],
… ,
palette='bright')

cols = mushroom_new.columns[:-1]
interact(column_box, col=cols);
Fig.3. The graph on the left shows the distribution of the ‘cap_shape’ variable belonging to the hat shape with countbox across the data set, and the graph on the right shows the relationship of the variable belonging to the hat shape with the target variable with barplot. [from author’s notebook]

Since missing vlues were detected in the ‘stalk_root’ variable, whose distribution is visualized in Figure4, the lines containing these data were removed from the entire data set. This increased the unbalanced tendency of the dataset. Here, it could perhaps be filled in by looking at the mode values of different existing root types in the data set, but if there are more suitable variables for classification, it can be continued without taking such a risk. In fact, if it is predicted that other variables will reduce the performance of the result with their weighting in this data set, it can be dropped directly as a column. Thus, the analysis can be continued without causing any change in the number of mushroom. Now let’s continue examining the variables with the dropped rows. When the distribution of the two classes (universal/partial) of ‘veil_type’ is examined, all of the 5643 mushrooms in total belong to the partial veil type (Figure5). There is no data on the universal veil type. For this reason, this variable is meaningless in terms of its effect on classification.

Fig.4. The graph on the left shows the distribution of the ‘cap_shape’ variable belonging to the hat shape with countbox across the data set, and the graph on the right shows the relationship of the variable belonging to the hat shape with the target variable with barplot. [from author’s notebook]
Fig.5. ‘veil_type’ is divided into u: universal and p: partial. All of the 5643 mushrooms belong to the partial veil type, which is meaningless in terms of its effect on variable classification. [from author’s notebook]

It can be said about the data that the poisonous status of mushroom species cannot be determined only by the shape or appearance of their hats. Or, cap color alone is not a determining feature for a mushroom to be edible. The variables to be classified were considered here as ‘cap_surface’, ‘cap_color’, ‘bruises’, ‘gill_size’, gill_color’, ‘stalk_shape’ and ‘habitat’. These were not only selected based on their overall distribution in the data set, but also evaluated in terms of their impact on the target variable. Figure6 shows the graphs of the general distribution in the data set. Figure7 shows the distributions for the target variable. The metrics obtained by classification by dropping the ‘stalk_root’ column with the same variables, without reducing the number of mushrooms, are given at the end.

Fig.6. Distribution of the variables selected for the classification model across the population. [from author’s notebook]
Fig.7. Distribution of the variables selected for the classification model according to the target variable. [from author’s notebook]

3. MODEL

Logistic Regression, a parametric model, and Random Forest, a model developed on the basis of decision trees, were used as models.

3.1. Logistic Regression

Under this heading, the pre-processing process of the data is also given. First of all, the data set was divided into train and test. The percentage of data that will not be seen during training was selected as 20%. Using the pipeline class of the Scikit-learn library, the encode operation on the variables was written in the same block with the model. Pipeline offers a more organized and repeatable code structure. Since the variables in the data have more than 2 classes, Ordinal Encode was chosen for the transformation of categorical data. Logistic Regression was used as the model. (Figure8).

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state=42)

oren = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value= -1)
trans = make_column_transformer((oren,["habitat","cap_surface","cap_color","bruises","gill_size","gill_color",
"stalk_shape"]), remainder="passthrough")

modelLogit = Pipeline([("encoder", trans), ("lr", LogisticRegression(random_state=42))])
modelLogit.fit(X_train,y_train)
Fig.8. Model used (Logistic Regression) [from author’s notebook]

3.2. Random Forest

Here, Random Forest, a decision tree-based algorithm, was applied (Figure9). It was compared with the result from Logistic Regression, a parametric algorithm.

modelForest = Pipeline([("encoder", trans), ("rf", RandomForestClassifier(random_state=42))])
modelForest.fit(X_train,y_train)
Fig.9. Model used (Random Forest) [from author’s notebook]

3.3. Random Forest (Grid Search)

In case of overfitting, the grid search method can be used to optimize the hyperparameters. The hyperparameter associated with the number of decision trees is “n_estimators”. Although the model improves as the number increases, it will cause overfitting after a certain limit. The parameter that determines the maximum depth of the tree is “max_depth”. How many samples a node should contain before branching is determined by “min_samples_split”. With this hyperparameter, the model can be simplified and overfitting can be prevented. This method has several parameters. The best performance of the model can be achieved by examining it according to need, selecting it and making trials.

params = {"rf__n_estimators":[150,155,160],
"rf__max_depth":[6,7],
"rf__min_samples_split":[2,3,4],
"rf__min_samples_leaf":[5,6,7],
"rf__max_features" :["sqrt",0.5,0.6,0.7]}

# with 5-fold cross validation
grid_search = GridSearchCV(estimator=modelForest3,
param_grid=params,
cv=5,
scoring=recall_five,
return_train_score=True,
n_jobs= -1)

grid_search.fit(X_train, y_train)
print(grid_search.best_params_)

4. RESULT

Here, it will be seen how successful the results obtained from the model are based on certain metrics. Confusion matrix content TruePositive: values that are predicted to be true and are actually true; FalsePositive: values that are predicted to be true but are actually false; TrueNegative: values that are predicted to be false and are actually false; FalseNegative: values that are predicted to be false but are actually true. It is observed how reliable the model gives results through the metrics obtained using the confusion matrix. It gives insight into how the model can be improved, especially if there is imbalance between classes. These:

— Accuracy: The ratio of all predictions to correct predictions. (TP+TN) / (TP+TN+FP+FN)

— Precision: The ratio of cases predicted as positive to those actually positive. TP / (TP + FP)

— Sensitivity (Recall) or True Positive Rate: The ratio of true positive cases to those correctly predicted as positive by the model. TP/(TP+FN)

— F1 Score: Harmonic mean of precision and recall. 2 * (Precision * Recall) / (Precision + Recall)

4.1. Logistic Regression

When the scores for the train and test data sets were compared, it was seen that the values were close to each other. While the model gave good scores on metrics in the train data set, it did not give bad scores in the test data set. Moreover, the metrics in both of them gave a value over 80%. (Figure10/Figure11). It cannot be said that a situation such as overfitting or underfitting is observed. The rate of how much of the mushrooms the model classifies as edible is seen with the sensitivity (recall) score. ROC (Receiver Operating Characteristic Curve) curve and precision-recall curve are used to obtain information about how much the classes seperate from each other. Of these, ROC curve is preferred for data sets with balanced distribution, and precision-recall curve is preferred for data sets with unbalanced distribution. Our data set tends to be unbalanced, so the precision-recall curve chart was shared. For the data set allocated as testing, the AUC (Area Under the Curve) obtained from the ROC curve gave 87%, while the AUC obtained from the precision-recall curve gave 86%. (Figure12). When these ranges are between 80–90%, the model is considered a good model. However, if it is a matter of accidentally eating poisonous mushrooms, how much should this rate be accepted? At this point, what is wanted in return for what is given is important. Evaluation interpretations of metrics may vary depending on the dataset.

predictedVals = modelLogit.predict(X_test)
predictedProbs = modelLogit.predict_proba(X_test)

predictedVals_train = modelLogit.predict(X_train)
predictedProbs_train = modelLogit.predict_proba(X_train)

plot_confusion_matrix(modelLogit,X_test, y_test)

print("TEST CM\n" , confusion_matrix(y_test, predictedVals))
print("-*"*30)
print("TRAIN CM \n", confusion_matrix(y_train, predictedVals_train))
print("TEST REPORT \n", classification_report(y_test, predictedVals, digits= 3))
print("-*"*30)
print("TRAIN REPORT \n", classification_report(y_train, predictedVals_train, digits= 3))
Fig.10. Confusion matrix for Logistic Regression model [from author’s notebook]
Fig.11. Values of model metrics using Logistic Regression [from author’s notebook]
y_probas = modelLogit.predict_proba(X_test)
skplt.metrics.plot_precision_recall(y_test, y_probas)
plt.show()
Fig.12. Precision-Recall Curve chart of the model using Logistic Regression [from the author’s notebook]

4.2. Random Forest

It was observed that the recall value was 99% among the metrics achieved with this model. Since accuracy, precision and f-1 score gave 100% and ROC curve area gave 1 value, the risk of the model’s result being suspicious was observed. (Figure13/Figure14). In such a case, it should be checked whether there are data leakage features that show perfect correlation with the target variable.

ypred_test = modelForest.predict(X_test)
ypred_train = modelForest.predict(X_train)

print("Evaluation metrics for test dataset:")
print("Confusion matrix:")
print(confusion_matrix(y_test, ypred_test))
print("Classification report:")
print(classification_report(y_test, ypred_test))

print("Evaluation metrics for train dataset:")
print("Confusion matrix:")
print(confusion_matrix(y_train, ypred_train))
print("Classification report:")
print(classification_report(y_train, ypred_train))

y_probas = modelForest.predict_proba(X_test)
skplt.metrics.plot_precision_recall(y_test, y_probas)
plt.show()
Fig.13. Values of model metrics using Random Forest [from author’s notebook]
Fig.14. Precision-Recall Curve graph of the model using Random Forest [from the author’s notebook]

4.3. Random Forest (Grid Search)

As a result, the score values obtained from the metrics decreased compared to the Random Forest model that does not use grid search. In the result obtained without using grid search, the f-1 score value is 1, and in the result obtained using grid search, this value is 97%. (Figure15). Here, the feature importance for the model can also be examined. Here, it is seen that the most contribution from the variables specific to this model used comes from the “stalk_shape” variable, and the least contribution comes from the “cap_surface” variable. (Figure16). The assumption of data leakage should not be ignored.

Fig.15. Values of Random Forest model metrics using grid search [from author’s notebook]
Fig.16. Visualization of the contributions of features to the model [from the author’s notebook]

Finally, the “stalk_root” variable was dropped as a column and the data was prepared again. Thus, the number of 8123 mushrooms in the data set was preserved. Since this variable was not considered important for the model, it was also dropped as a column above, even though missing values were deleted from the rows. It is a featured that has already been sacrificed for this data set. But this is not always the case. Sometimes, the variable with missing data is very important for the model to perform a successful classification. At such times, the options of filling in the missing data with different options such as mode and mean, suitable for categorical or numeric data, or dropping them from the rows at the risk of decreasing the number of data are evaluated. This causes changes in the entire data set. When we return here, let’s continue by preserving the initial distribution of the target variable in the data set that we reduced as a column. The distribution of the class of the target variable in the data is balanced this time. (Figure17)

Fig.17. The class distribution of the target variable in the data set where the number of data is preserved while missing data are dropped. [from author’s notebook]

Since the article is too long, how this change in the direct data affects the result is shown through the ROC curve. Logistic Regression was applied as the model as above. The ROC curve graph of the model applied above and the graph taken from the final version of the data were compared side by side. The graph on the left is the ROC curve of the model worked on by reducing the number of mushrooms by dropping the missing data from the rows. The graph on the right is the ROC curve of the model in which the missing data is studied by dropping the column of the variable to which it belongs, not as a row, without reducing the number of data. The scores of the model on the right were observed with higher values. (Figure18).

Fig.18. The graph on the left is the ROC curve of the model in which missing data are dropped from the rows; The graph on the right is the ROC curve of the model studied by dropping the column of the variable to which the missing data belongs. [from author’s notebook]

--

--