What Decision Trees Tell Us About Deadly Mushrooms

Evan Quinlan
Analytics Vidhya
Published in
6 min readMay 18, 2020
A decision tree algorithm recursively splits a set of data samples into subsets, trying to end with leaves of similarly classed samples.

The UCI Machine Learning Repository hosts a dataset drawn from The Audobon Society Field Guide to North American Mushrooms that describes 8,425 species of mushroom using features like cap shape, gill size, odor, and habitat. Each record is also classified as either edible or poisonous, making it a fun choice for exercises in data science:

The data describes North American mushrooms with 22 features and a toxicity classification.

A common question asked of the mushroom data is, Can we formulate easy-to-memorize rules for determining whether a mushroom is edible or poisonous? In other words, analysts seek the smallest number of features that, taken together, give the largest indication of a given mushroom’s toxicity class. (Most analysts ignore the fact that some features, like odor, gill spacing, or population, even if 100% correlated with toxicity, would still be difficult for a layperson to determine unprepared in the wild… and we’ll ignore that as well 😉.)

First of all, can any algorithm predict toxicity for certain?

Leaving the “easy-to-memorize” rule aside for a moment, it might be useful to know if any model trained on the data can reliably classify a mushroom from the given features. Because the data is entirely categorical, a good fit for this could be a random forest ensemble algorithm, which averages the result of two or more decision tree algorithms.

Using scikit-learn, we can easily construct a random forest model and fit it on training data. When I do that then run the model on test data, its F-score yields an unusual 1.0, or 100% accuracy. So I’d say that counts as a yes — an algorithm can predict toxicity of North American mushrooms based on these 22 features!

Alright, so let’s take a look at one of those decision trees to see if any clear rules can be found for classifying toxicity that a person might memorize and use later:

A decision tree generated by a scikit-learn Decision Tree Classifier.

…oh. That’s not very easy for a person to read, let alone memorize and apply in the wild. We could scour that giant decision tree for simple rules, but before we do that, let’s ditch the fully trained model and ask, is there a single feature that reasonably indicates toxicity? That would make things really simple!

What’s the #1 indicator of toxicity?

Usually, to discover which single feature best predicts the class, we might calculate the covariance of each feature-class pair and choose the one with the highest absolute value. However, in this case, the entire dataset consists of categorical — as opposed to quantitative — data, so covariance isn’t an option. We could try calculating the uncertainty coefficients, which is the solution employed by a Kaggle user analyzing this same dataset, but I’m going to suggest another strategy instead, which is to explore relationships between variables using only decision trees.

At each node, a decision tree asks of the data, which value of which feature best splits the samples into two similar groups? In our case, the ideal split would cause the samples to fall into two “camps”: the “edible camp” or the “poisonous camp”. Let’s create a new decision tree based on all the samples (the tree printed above used only a subset of the training data), and limit it to a depth of 1, which means we only want it to make one split on the data:

A decision tree with a depth of 1 rendered by Graphviz.

Well that’s much easier to read! Now let’s decode the rather cryptic output in these boxes. Here’s what the top one says (reordered for clarity):

  • samples We’re looking at 8,416 samples (the entire dataset).
  • gini → Those 8,416 samples have a Gini impurity of 0.498 (on a scale of 0 to 1), which indicates our data is rather evenly split between the edible and poionous classes.
  • value → 4,488 of the samples are edible, and 3,928 are poisonous.
  • class → The most common class is edible.
  • Odor_NONE ≤ 0.5 → The best way to split the samples into “camps” is by whether or not their mushrooms have no odor (i.e. if you smelled it, you’d smell nothing). NOTE: This makes more sense if you know I one-hot encoded all the categorical features in the data, meaning each possible value for each categorical feature became its own Boolean feature with a value of True or False.

Most machine learning models require the data be in numerical form, and for Boolean values True = 1 and False = 0. “Odor_NONE ≤ 0.5”, therefore, can be translated as “Odor_NONE is False”. With that in mind, let’s shift our attention to the two bottom boxes.

We can see that if the statement “Odor_NONE is False” is indeed True (follow the True arrow), we get a nice, relatively small Gini impurity of 0.287, because 3,808 of the 4,608 qualifying samples are poisonous. This is our “poisonous camp”. The other box, containing samples for which the statement “Odor_NONE is False” is False (follow the False arrow) has an even smaller Gini impurity of 0.061, because 3,688 of the 3,808 qualifying samples are edible. This is our “edible camp”, and it’s pretty darn pure!

So what does all this mean? Will smelling a mushroom to see if it has an odor tell us everything we need to know? The answer is: Sure, if you’re feeling lucky. The fact that an odorless mushroom would still poison you 6.1% of the time means I wouldn’t recommend it as a safe method. But in a desperate situation, maybe it’s better than nothing.

So what if it’s not a desperate situation, and we want to be way more accurate?

What are some accurate but easy-to-memorize rules for predicting toxicity?

Let’s say you’re hungry in North America and you really want to eat a mushroom, but you want to know for certain it isn’t poisonous. What’s the easiest rule we can come up with for this?

One thing we could try is allowing our decision tree algorithm to make a few more splits before terminating. In other words, we can increase the tree’s depth until we find a satisfactory outcome: a leaf node with class edible and a Gini impurity of 0.

First, let’s try a depth of 2:

A decision tree with a depth of 2 rendered by Graphviz.

No luck. The leaf node in the very bottom right does have a Gini impurity of 0, but it contains only poisonous mushrooms. That will tell us what to avoid, but not what to eat!

Let’s try a depth of 3:

A decision tree with a depth of 3 rendered by Graphviz and modified with highlights.

Look at that! At depth 3, we have two rules to choose from:

  • If the mushroom isn’t odorless, has a stalk root that isn’t club-shaped, and has a non-scaly stalk surface below the ring, it’s guaranteed to be edible! There are 192 different species in that group.
  • If the mushroom isn’t odorless, has a stalk root that is club shaped, and has no bruises, it’s also guaranteed to be edible! There are 512 different species in that group.

Notice our #1 rule was to look for odorless mushrooms, but both the simple rules we found avoid odorless mushrooms! That’s because now we’re looking only for rules that will lead to 100% success. Odorless is the best single-feature rule of thumb, but the most accurate multi-feature rules say your mushroom should have a distinctive odor (in the data, the options are foul, fishy, spicy, anise, almond, pungent, creosote, and musty).

One other thing to remember is, our two rules won’t net you all the edible mushrooms. Rather, using our two rules, you’ll be limiting yourself to 192 + 512 = 704 species. However, at least you’ll know what you find is edible (at least based on this data and analysis).

Anyway, we found two rules for finding edible mushrooms in North America, just by examining decision trees! I hope this exercise has illustrated what can be learned from decision trees and perhaps shed some light on how the algorithm works!

View the GitHub repo for this work.

Eat mushrooms responsibly! 💀

--

--