DATA STORIES | MUSHROOM CLASSIFICATION | KNIME ANALYTICS PLATFORM

Using Data Science to Identify Features Most Indicative of Edible or Poisonous Mushrooms

From data preparation to model training, testing and evaluation with just a few clicks

James Phillip Sanders
Low Code for Data Science

--

POISONOUS MUSHROOMS & FUNGI DISCLAIMER. IMPORTANT — PLEASE READ THIS FIRST!

Many mushrooms and fungi are poisonous and some are deadly poisonous. By viewing this post you agree that its author accepts no liability for any injury or death occurring as a result of ingesting or/and exposure to any mushroom or fungi, either mistakenly believing it to be edible, or as the result of an unforeseeable reaction or allergy to any mushrooms or fungi.

This post is for educational purposes only!

The toxic mushroom Amanita Muscaria, the most easily recognised “toadstool”.

Introduction

The standard for the name “mushroom” is the cultivated white button mushroom, Agaricus bisporus; hence the word “mushroom” is most often applied to those fungi (Basidiomycota, Agaricomycetes) that have a stem (stipe), a cap (pileus), and gills (lamellae, sing. lamella) on the underside of the cap. “Mushroom” also describes a variety of other gilled fungi, with or without stems, therefore the term is used to describe the fleshy fruiting bodies of some Ascomycota (Wikipedia, 2021).

Agaricus is a genus of mushrooms containing both edible and poisonous species, with possibly over 300 members worldwide. Members of Agaricus are characterized by having a fleshy cap or pileus, from the underside of which grow several radiating plates or gills on which are produced the naked spores (Wikipedia, 2021).

Lepiota is a genus of gilled mushrooms in the family Agaricaceae. Around 400 species of Lepiota are currently recognized worldwide. Many species are poisonous, some lethally so. All Lepiota species are ground-dwelling saprotrophs with a preference for rich, calcareous soils. Basidiocarps (fruit bodies) are agaricoid with whitish spores, typically with scaly caps and a ring on the stipe (Wikipedia, 2021).

While modern identification of mushrooms is quickly becoming molecular, the standard methods for identification are still used by most and have developed into a fine art harking back to medieval times and the Victorian era, combined with microscopic examination. The presence of juices upon breaking, bruising reactions, odors, tastes, shades of color, habitat, habit, and season are all considered by both amateur and professional mycologists (Wikipedia, 2021).

Inspiration

✓ What types of machine learning models perform best on this dataset?
✓ Which features are most indicative of a poisonous mushroom?

Goal

The goal is to build a predictive model to predict whether the input mushroom is eatable. The predictive model will help the users to prevent accidentally be poisoned caused by the lack of knowledge in identifying the mushroom type.

Acknowledgements

This dataset was originally donated to the UCI Machine Learning repository. You can learn more about past research using the data here.

1. Data Preparation

1.1 Missing Values & Treatments

After the initial reading of the CSV file into KNIME, I used exploratory data analysis where I found that stalk-root contains 2480 missing values, which equates to 30.53%. Considering my various options to handle missing values, I have put down to these four:

  1. Ignore missing values if the dataset is very large, resulting in a lower percentage of records with missing values.
  2. Ignore the variable if insignificant.
  3. Treat missing data as just another category.
  4. Build a model to predict missing values.

As there is a very large percentage, too large to simply just remove these records from the dataset which could significantly reduce the quality of the final prediction model due to reduced input data, I have chosen option 4: build a model to predict missing values. The algorithm I have selected to model this data with is the Gradient Boosted Trees. Using KNIME, it was very straight forward to implement.

First, I split off all the records containing missing values of which I will run through the prediction model later, after I have first trained and tested it. Next, I ran the output table through an Equal Size Sampling node so that values are equally distributed in the column where the missing values exist. This is to help prevent bias learning in the model, if there is an unequal distribution of the class. From here, I partitioned the dataset into training set and a test set at a ratio of 70:30. The test set is shuffled, and the class of answers split from the remaining classes.

This will be later used to evaluate the performance of the model. I have then shuffled the training set and put it through the training algorithm to create the learning model. Now, I can test the model with the test set and check its performance using the Scorer node against the separated answers I did previously. This produces a result with 230 correct classifications and 1 wrong classification, which you can see in the confusion matrix below.

Figure 1. Gradient Boosted Trees Model Confusion Matrix.

1.2 Type Identification & Conversion

Table 1. Available Data Brief.

Using exploratory data analysis, I was able to formulate the above table to which I have summarized the available data. Using this insight into the data, I have performed some preprocessing operations, adjusting the values of class from ‘p’ to “Poisonous” and ‘e’ to “Edible”, which will provide ease of readability further into the analysis, along with adding color for these two values. Also, I have converted ring-number values to discrete numerical values using the Rule Engine node, this however could stay as categorical data but for consistency purposes I have converted it.

1.3 Attribute Selection & Reasoning

In addition to the above, I found that veil-type contains only one unique value for all 8124 records “partial” which I have omitted from the dataset for modelling, as this attribute will have a very low impact on the outcome of the prediction model. I performed this operation using a Column Filter node in KNIME. Furthermore, the remaining low impact attributes I will keep in the dataset for modelling the prediction as they still have significance in relation to other attributes.

1.4 Shuffling & Partitioning

Figure 2. Using KNIME to split dataset into smaller sets.

1.4.1 Procedure Explanation

The depiction above shows three subsets of the initial dataset, of which are of all equal size. These three data subsets have also each been split again into training and test data subsets with the answers removed from the test data subset, which in this case is the target attribute identified previously.

How I split the data into these several subsets it relatively straightforward. First, I shuffled the data using the Shuffle node before partitioning, as in this case I partitioned all data using the liner sampling method. However, the shuffling of the data can come after the partitioning also, but I would recommend using random partition not linear sampling in that case. Next, partitioning the data into six equal subsets may at first seem tricky, but the math is quite simple, (6 ÷ 2) ÷ 3 = 1. The math tells me the first partition will have 50%, the two second partitions will have 66.66% or 33.33%, and lastly partitions three and four will have 50% again, this results in evenly distributed subsets. I have confirmed this through the statistics of each subset. The even distributing of each row was achieved by shuffling the data first then partitioning with all nodes set to use the linear sampling method as I previously described.

2. Predictive Model

2.1 Selection

The initial dataset has been split into 3 equal subsets for the purpose to compare three prediction models in a bid to find the best performing model for the desired outcome. I will be comparing three classification algorithms.

  1. Random Forest
  2. Ensemble Learning with Gradient Boosted Trees
  3. Decision Tree

I have selected these three algorithms as they are specifically suited for classification and do not require the data to be converted from its current data types. After running the test data through the trained model, I can then compare the accuracy of each model through the analysis of the outputted results using statistics.

2.2 Train, Test & Evaluate

Using KNIME, I have run each of the three data subsets through a selected predictor algorithm to predict the outcome of the target attribute. I did this by, first, shuffling and then running the test data subset of the initial data subset in question through an algorithm learner node using the standard settings. This output is then put into the algorithm predictor node, where it uses the learner model built from the last node, to predict the outcome for the target attribute of the input data, being the test data subset. The outputted data from the predictor node is then joined with the answers data subset, which were previously removed from the test data subset. The answers will then be cross analyzed against the predicted outcome answers.

The final outputted dataset can now be analyzed to evaluate the performance of the prediction algorithm in question. Furthermore, I have then applied three nodes to this final dataset to help inspect the data metrics of the cross verification, of which I have given more details about below, grouped by the algorithm used.

Scorer

The Scorer node statistics obtained from running the algorithm in question on the test data subset shows several key metrics to help evaluate its prediction performance, specifically the Confusion Matrix and the Class Statistics. Here, we can see all the true positives, false positives, true negatives, and false negatives, along with precision and the overall statistics.

ROC Curve

The ROC Curve plots the true positive rate against the false positive rate. The positive class value is used to check the predicted class probabilities. To create a ROC Curve for a model, the input table is first sorted by the class probabilities for the positive class for which the model is certain that it belongs to the positive class. Then the sorted rows are checked if the real class value is the positive class. If so, the ROC Curve goes up one step, if not it goes one step to the right. Ideally, all positive rows are sorted to front, so you have a line going up to 100% first and then going straight to right. As a rule of thumb, the greater the area under the curve, the better is the model (KNIME, 2021).

2.3 Random Forest

Figure 3. Random Forest Scorer Statistics & Random Forest ROC Curve.
Figure 4. Random Forest Binary Classification Inspector.

2.4 Gradient Boosted Trees

Figure 5. Gradient Boosted Trees Scorer Statistics & Gradient Boosted Trees ROC Curve.
Figure 6. Gradient Boosted Trees Binary Classification Inspector.

2.5 Decision Tree

Figure 7. Decision Tree Scorer Statistics & Decision Tree ROC Curve.
Figure 8. Decision Tree Binary Classification Inspector.

Note. Besides implementing a correct data science pipeline and model training, the very high accuracy achieved by the models above is very likely to be attributed also to the quality and informativeness of the features in the dataset.

Conclusion

What types of machine learning models perform best on this dataset?

Classification models are clearly the best suited method to use for prediction on this data set. From the above model performance statistics, we can see there were some Type I errors in the positive class values. However, overall results show the models to be quite accurate regarding precision. Furthermore, one thing I would like to highlight is the process of converting the attribute ring-number to an integer. This process of converting the datatype could be eliminated, since keeping the original type will provide the same output.

Using Apriori algorithm, I was able to find features most associated with edible and poisonous mushrooms. Filtering results to a minimum support of 35% (2843 instances) and minimum confidence of 85% resulted in:

Edible Mushrooms

  1. odor=none, ring-number=one ==> class=Edible
  2. odor=none, gill-size=broad ==> class=Edible
  3. odor=none, gill-attachment=free, gill-size=broad, ==> class=Edible
  4. odor=none, gill-size=broad, veil-color=white ==> class=Edible
  5. odor=none, gill-attachment=free, gill-size=broad, veil-color=white ==> class=Edible

Poisonous Mushrooms

  1. bruises=false, gill-attachment=free, gill-spacing=close, ring-number=one ==> class=Poisonous
  2. bruises=false, gill-spacing=close, veil-color=white, ring-number=one ==> class=Poisonous
  3. bruises=false, gill-attachment=free, gill-spacing=close, veil-color=white, ring-number=one ==> class=Poisonous
  4. bruises=false, gill-spacing=close, veil-color=white, ==> class=Poisonous
  5. bruises=false, gill-attachment=free, gill-spacing=close, ==> class=Poisonous

References

KNIME. (2021, May 04). ROC Curve. Retrieved from KNIME Hub: https://hub.knime.com/knime/extensions/org.knime.features.js.views/latest/org.knime.js.base.node.viz.plotter.roc.ROCCurveNodeFactory

Wikipedia. (2021, Apr 22). Agaricus. Retrieved from Wikipedia.org: https://en.wikipedia.org/wiki/Agaricus

Wikipedia. (2021, Apr 22). Lepiota. Retrieved from Wikipedia.org: https://en.wikipedia.org/wiki/Lepiota

Wikipedia. (2021, Apr 22). Mushroom. Retrieved from Wikipedia.org: https://en.wikipedia.org/wiki/Mushroom

--

--

James Phillip Sanders
Low Code for Data Science

My passion and expertise are in the domains of Data Science, Artificial Intelligence, Machine Learning, and Deep Learning. http://linkedin.com/in/jmspsndrs/