Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Decision Trees and Dinosaurs

13 min readJan 6, 2019

--

Imagine that you and I are planning a trip together. This trip will take a lot of preparation, because it is a trip through time, to Earth’s distant prehistory. Our journey is a sightseeing expedition. We’re planning on seeing all kinds of wonderful creatures, and in particular the stars of the primeval bestiary, the dinosaurs. But we’ll have to be careful! This is not a trip to the zoo! The dinosaurs are seeing us as much as we are seeing them, and not all of them have kind intentions. If we are to remain unmolested by the inhabitants of this antediluvian world, we’ll have to accurately discriminate between the two primary clades of dinosaur: The harmless herbivores, which we can expect to simply stare back at us as we watch them, and the deadly carnivores, which would prefer to eat us.

This could be us

How best to achieve this determination of friend and foe? There are many kinds of dinosaur, some little-known to science, and an exhaustive study of them would be impractical. We won’t have access to power or internet in the Jurassic, and even if we could remember every detail of every species known to current science, any unknown species would be a complete mystery to us. We need something that can generalise from what we know now, to handle new situations. What we need are some simple rules that we can apply to the problem. Rather than learning every detail of dinosaur biology, we’ll create a “model” — a set of rules — which will help us more easily make a decision. With the help of a computer to crunch the numbers, we can come up with an optimal set of rules for identifying dinosaurs’ diets, allowing us to travel safely through time, without having to pack a stack of encyclopedias in our time machine.

The name of this kind of problem is “classification”, and it is one of the earliest examples of widely-adopted computer learning. Its applications include identifying spam emails in your inbox, flagging likely leads for telemarketers, and, at its most sophisticated, picking faces out of photographs.

To prevent our prehistoric adventure from ending in death at the jaws of a carnivorous reptile, we are going to construct a “two-class decision tree classifier”, an extremely simple model which, from a few numeric inputs, will tell us if any given dinosaur species is a harmless curiosity, or a deadly threat. It is “two class” because it identifies just two different dinosaur classes: Carnivorous, or not. It is a “decision tree” because the method it uses to distinguish these two classes takes the form of a branching series of decisions, like a flow chart (or a tree) that guides us to a decision about the most likely class.

The heart of this computer-aided classification is the insight that computers, aided by some clever maths, can much more accurately find patterns in a set of data than a human can. We could have an expert examine each dinosaur species, record their dietary preference, and then every time we meet a new dinosaur, leaf through our almanac until we find the entry for this species. But for species not found in the book, we are out of luck. Instead, we could have our human expert teach us a set of rules by which we could make a determination. But parsing this large amount of data, weighting each exception and permutation, is a huge task which humans often perform poorly. A computer can do this same job far more quickly, far more accurately, and for large volumes of data, at a scale that humans cannot even attempt.

To begin constructing a classifier, we must gather some facts about a set of dinosaur species, and encode it in such a way as to make it palatable for our algorithm. We’ll make a table in which each row represents a single dinosaur, and each column is a numeric value representing some information about that dinosaur. So, for example, Stegosaurus, that gentle giant of the Jurassic, and Tyrannosaurus Rex, the ferocious tyrant king of the Late Cretaceous, might be represented by the following two rows:

Table One: Data for two dinosaur species

You’ll notice that some of this data is represented in a slightly unusual way. Rather than a column called “Gait”, or similar, containing the word “bipedal” or “quadrupedal”, we have a column called “Bipedal”, containing either a one (this dinosaur is bipedal) or zero (it’s not bipedal). This encodes the category in a numeric format that the algorithm is able to interpret.

I gathered information on 20 of the best-known dinosaur species to train our classifier. These included a mix of herbivorous and carnivorous dinosaurs, quadrupeds and bipeds, small and large, and those from both the lush jungles of the primitive Jurassic and the leafy forests of the much later Cretaceous.

On these, I let the decision tree algorithm — a series of simple steps, repeated — do its work. The algorithm is simple enough that we can follow it ourselves:

We examine each of the columns in our table of data, and find the single column, and the single value of that column, which best separates the carnivores from their leaf-eating brethren. In this case, it is the “Bipedal” column. Quadrupedal dinosaurs are, with no exceptions in our current data, strict vegetarians. This decision becomes the “trunk” of our decision tree — the first decision node from which the others branch. For our quadrupeds, the algorithm is complete, and no further decisions are required. But for the bipeds, we are left with a mixed bag of carnivores and herbivores, and we must construct additional nodes. In this case, the best split for the remaining examples is to measure the length of the subject. If your dinosaur is more than 12 meters from nose to tail, then you are in trouble — it is a carnivore, and you are standing next to it with a step ladder and a tape measure. However the smaller bipeds are still a mix of carnivores and herbivores, and we can continue constructing nodes.

The final decision tree is a bit like a “choose-your-own-adventure” book. To decide if a given dinosaur is safe or not, we navigate through the nodes, making a single choice at each one, until we reach a “leaf” node, which tells us the most likely class. Here’s the tree:

Figure One: Decision tree diagram

The algorithm has learned some rules about dinosaurs: All quadrupeds are herbivores, bipeds over 12 meters long are carnivores. For bipeds under 12 meters, the lightest ones (under 800 kgs) are more likely to be carnivores, and the heavier ones herbivores.

This is all very well for our small set of twenty dinosaurs, but how well will these rules perform out in the wild? Will we have an uneventful and pleasant trip through time, or will our holiday be spoiled by the unanticipated appetites of a gigantic reptile? We can test this by finding a new group of dinosaurs to classify with these rules that we have learned, and check how many we predict correctly. Remember, one of the key qualities we’re looking for in a model is generalisability. Even if we found data on every known species of dinosaur, in the distant past we’re bound to meet all kinds of hitherto undiscovered species. We want to make sure, as best we can, that our model will work for them as well.

To test this, I made another list of slightly less common dinosaurs, and ran each of them through the decision tree model, making a prediction for them based on their gait (bipedal vs quadrupedal), their length, and their weight. The results were… …not good.

The small and bird-like bipeds Deinonychus and Albertonykus were correctly classified as carnivores (though, being around the size of a turkey, the latter was unlikely to pose much of a threat). Likewise, the ankylosaurian Dracopelta was correctly assigned herbivore, on the basis of its quadrupedal gait. But the harmless Pachycephalosaurus, with its thickened skull thought to be for head-butting rival males, was classed as a carnivore — it is under 12 meters, and weighs more than 800kgs. More troublingly, three dangerous carnivores, Albertosaurus, Megalosaurus, and Yangchuanosaurus were all classified as herbivores, which might have presented us an unwelcome surprise.

Yangchuanosaurus. I hadn’t heard of it either.

Something has gone wrong with our model! What has happened is that the algorithm has learnt a pattern that appeared by chance in our data, but does not generalise to the wider population. While our training data had very few small but heavy carnivores, in fact these are quite common. This is a weakness of many algorithms, but especially of the decision tree, and especially with such a small set of data. When you have a small number of examples to learn from, and a large number of different ways you can split those examples up — by weight, by length, and so on, then it is very easy to find spurious rules.

The algorithm found that quadrupeds are universally vegetarian, and it seems plausible that this rule might hold for other species, outside our dataset. But the rules about weights and lengths seem suspicious. The dividing lines that the algorithm found — length of 12 meters and weight of 800kgs — seem unlikely to be general rules of dinosaur evolution, and are more likely to be an accident of which species happened to be included in our data.

We have several options for attempting to fix this:

We could add a great deal more data. With more data, there’s less chance of finding accidental patterns and we’re more likely to discover real rules about dinosaurs. But getting more data can be difficult, and there’s no guarantee that we will find the rules that we want to — it’s possible that no real patterns exist in the features (weight, length, etc) we’re looking at.

We could try a different algorithm. There are lots of other ways that we can build classifying models, and many of them are much less susceptible to finding arbitrary rules as our decision tree has. But these algorithms are substantially more complex, and for the purposes of this chapter, I want to focus on this very simple approach.

The third option, and the one we will choose, is to look for new features to add to our data. That means finding out more about the dinosaurs in our set, and adding it to what we already know. Hopefully, some of this new information will be more useful in creating generalisable rules about carnivorous and herbivorous dinosaurs.

This led me into the confusing and contentious world of dinosaur evolution. There are few, if any, unassailable orthodoxies in palaeontology, and even the fundamental family tree of the dinosaurs is constantly uprooted, re-planted, grafted, and pruned.

Yet a few themes remain constant. The chief divide in the dinosaur lineage is between the Saurischians and the Ornithischians. These noble houses are divided mainly by the orientation of their hips — the “lizard hipped” Saurischians keep their hips oriented downwards and out from their bodies, while the “bird hipped” Ornithischians prefer theirs facing backwards, in an orientation similar to that of modern-day birds. Confusingly, however, it is the Saurischians who would become the ancestors of today’s birds, not the superficially more similar Ornithischians.

Usefully for our analysis, it is one branch of the Saurischians who represent the vast bulk of carnivorous dinosaurs. This branch of Saurischians, the Theropods, count among their number both the delicate and small Deinonychus, and the huge and robustly-built Tyrannosaurs. Theropods are almost exclusively meat-eaters. If we can identify the common features of the Theropods, we can vastly improve the performance of our model.

The first extra feature I added was a simple calculation of “tonnes per meter”, I reasoned that, while the rules the previous model found about weight and length might not have stood up, it’s possible that there are differences in the build of predatory dinosaurs, that they might, in general, be either more or less heavily built than herbivores.

I also looked at some of the more exotic traits of some dinosaurs. Stegosaurus had plate-like spines, which are thought to be either for the purpose of display or defense, or possibly regulating heat. Parasaurolophus, my favourite dinosaur, had a nasal crest that is thought to have allowed it a great honking mating call, or else it might have supported a decorative frill. I hoped that perhaps, living more often in herds, and having to defend themselves from predators, herbivores might be more likely to have these defensive or display features.

Parasaurolophus. The best dinosaur.

Lastly, and perhaps controversially, I researched which species, according to current research, are believed to have been extensively feathered. This is a list which included a surprising number of supposedly well-known dinosaurs. It also turned out to be incredibly complicated, with several points still contested.

But it was a promising avenue! Our herbivorous Ornithischians are all either scaled or covered in filamentous hairs. The lumbering sauropods are universally scaled. It is only among the carnivorous Theropods that we find vaned feathers like modern-day birds, and the “plumulaceous integument” (or “downy coat”) that is now believed may have made the terrifying Tyrannosaurus Rex and its relatives resemble something more like an enormous fluffy turkey.

With this added data encoded into my data, I re-ran the decision tree algorithm, letting it re-calculate the best way of splitting the rows of data to separate the herbivores from the carnivores.

Here’s the decision tree it generated:

Figure Two: A more effective tree

The first rule it found is the same as before — quadrupedal dinosaurs are never meat-eaters. But from there, it is very different. All of the feathered dinosaurs are carnivores, and of those un-feathered species that remain, we can neatly divide the herbivores from the carnivores by looking at their ratio of weight to length.

Checking these new rules against our test set, we now achieve a perfect accuracy, with the harmless Draopelta and Pachycephalosaurus classified as herbivores, while the dangerous Albertosaurus, Megalosaurus, and Yangchuanosaurus are all flagged as carnivores. Armed with this new classification system, we can traverse prehistory with confidence. We can sketch this flow chart on a sheet of paper, pack a measuring tape and a sturdy set of scales, fire up the time machine, and set off.

But is it really so simple? We might, having tested our model and achieved perfect accuracy, assume that we will continue to achieve similar results in practice. This is a dangerous assumption.

Any classification model is only as good as the data that it was trained on, and the assumptions that were made about that data. There are countless biases, blind-spots, and oversights implicit in any model, and this one is no exception.

For example, if during an impromptu prehistoric swimming expedition we encounter a Plesiosaurus, we will struggle to classify it with our model. Is it bipedal or not? It likely evolved from quadrupedal, crocodilian ancestors in the pre-Jurassic Permian or Triassic periods, but in the Jurassic, it swims with four great flippers. Our model, trained on terrestrial species, is silent on how to classify aquatic dinosaurs.

Similarly, our model assumes dinosaurs from the Cretaceous or Jurassic periods. The much earlier Permian was home to dinosaur-like synapsids (in fact more closely related to mammals), such as the sail-backed and carnivorous Dimetrodon. Dimetrodon was carnivorous, and like most of the apex predators of the Permian, was quadrupedal. A traveller to the Permian will be badly misled by our model.

More insidious are the biases in our model that are invisible to us because of the cultural or historical context in which the model is created. Our model is trained on some of the best-known dinosaur species, and so on well-known species, it performs well. But which species are known well and which are not is a product of historical accidents. The native dinosaurs of countries in which the mainstream of palaeontology developed will almost certainly be better-known. Those first few species to be unearthed, identified, and publicised have inevitably captured public awareness in a way that more recent discoveries have not.

Our model tells us that feathered bipeds are invariably carnivores, and this is largely true of the species found in the United States and Western Europe. But recent research has cast this neat heuristic into disarray. These discoveries include Siberian dinosaurs such as the feathered herbivore Jianianhualong tengi, or Kulindadromeus, an Ornithischian with a coating of feather-like structures previously thought the exclusive province of Theropods. Chinese species, like the bizarre long-clawed and down-coated Theropod herbivore Therizinosaurus are further confusing the picture. Had we trained our model on the dinosaur species that are common to China and Siberia, we might have created quite a different model, and Chinese time-travellers who wish to visit the prehistory of their homeland will be ill-served by our current algorithm.

A friendly Therizinosaurus, a feathered herbivore

Most modern implementations of classification algorithms are vastly more complex than the simple model we have created here. They involve a much larger number of features, much more sophisticated parsing of the patterns in that data, and much, much larger datasets to learn from. Many of these models are extremely accurate, and their classifications, which are based on rules derived algorithmically from the data, provide innumerable advantages over human-driven decision-making. They are faster, more reliable and consistent, and are frequently able to perceive patterns that even the most observant humans are unable to detect.

But the basic concepts that underpin our simple dinosaur model also apply to the most complex classifiers. The algorithm discovers patterns in the data it is given, and uses those patterns to establish rules about how to classify new examples. From the initial data, the model makes a generalisation about the wider population, and that generalisation is how it makes its predictions. These rules and generalisations can be very complex, and very accurate, but they are only ever as good as the data from which they were originally created, and the assumptions that were made about that data. These complex models are vulnerable to the same biases and limitations as our very simple model. As products of human minds, trained on human data, they are subject to human fallibility. On our trip back through time, we’ll leave behind the modern day world and its failings. But in a sense, we will also bring them with us, embedded in the assumptions we make.

Code for this essay can be found here.

Part Two, “Linear Regression and Lines of Succession” is available here.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Responses (2)