Predicting tornado magnitude with Machine Learning

Yves Jacquot
27 min readOct 10, 2017

Scope

Rating the intensity of tornadoes is a very complex task, due mainly to the fact that they usually are not predictable, and when they are, it is only a few minutes before they occur. Measuring their wind speed, requiring observers and doppler radars on site, is therefore not possible in most cases.

The Enhanced Fujita (EF) Scale was implemented in 2007 and is a revised version of the original Fujita scale, which was first introduced by Tetsuya Theodore Fujita in 1971. They are both based on the damage the tornadoes cause. The new scale expands the degrees of damage, and better accounts for variables such as differences in construction quality. It contains six categories from EF0 to EF5 (weaker to stronger):

Enhanced Fujita Scale definitions

The main goal of my study was to try and predict the EF scale of tornadoes based on their characteristics and effects, such as location, date, time, source, damage costs, number of casualties, etc. This was achieved through the implementation of different Machine Learning models and techniques, which we are going to look into more details in the following paragraphs.

For the complete set of Jupyter notebooks with Python code and detailed Markdowns of my process through this project, please refer to my GitHub page.

Data origin

I first came across the Severe Weather Data through Climate Mirror, a distributed effort conducted by volunteers and institutions to mirror and back up U.S. Federal Climate Data and to store sensitive climate change data. The Storm Events database is public domain. It contains information about U.S. storms as entered by the National Weather Service (NWS) since 1950. The National Centers for Environmental Information (NCEI) then perform data reformatting and standardization of event types.

Web Scraping

NCEI made accessible three relational databases called “details”, “fatalities” and “locations”, which were reformatted in zipped .csv files for each single year.

To retrieve those files in an automatic and repeatable way, I used web scraping techniques, and in particular the Python’s libraries requests for efficient http queries and BeautifulSoup for easy information extraction.

I extracted the events corresponding to tornadoes only: 65268 records from January 1950 until May 2017. The next coding excerpt was taken from the part 2 Jupyter notebook in my GitHub.

Web Scraping using Python’s libraries requests and BeautifulSoup

Data storing in PostgreSQL

For easy access to the data, I saved them into a local postgres database using Python’s toolkit SQLAlchemy. Below is an example of an SQL query to examine the occurrences of the EF5 tornadoes:

SQL query of my local PostgreSQL database, showing EF5 and F5 tornadoes ordered from most to least recent

Interesting findings can be made through that simple query: the most destructive tornadoes are not common. The last one occurred in 2013, and there was a gap of 8 years without any between 1999 and 2007. 2011 was an exception as there was a tornado outbreak including eleven EF5 ones, most likely a consequence of “La Niña” climatic phenomenon.

Tableau visualisation of U.S. tornadoes geographical distribution

Many insightful visualisations can be made from this impressive tornado database. The geographical distribution is one of them. Latitude and longitude coordinates of the beginning and end point of each tornado are some of the many features of the dataset. Using Python code, the latitude and longitude of the midpoint of each tornado’s path were computed. The midpoint exact location of all tornadoes from 1950 until May 2017 could be plotted on a single map using Tableau software. Colors, from dark blue to dark red, represent the EF scale, from 0 to 5 respectively. It is worth noting here that the scale “EFU” is entered in the database when it is unidentified ie not enough information was given to the NWS regarding the caused damage.

There is a much higher density of tornadoes from the East of the Rocky mountains. Tornadoes are more common in the Great Plains, and more precisely in what is called the “tornado alley”. Tornadoes can only form in very special weather conditions, when cold, dry air from Canada collides with warm, moist air from the Gulf of Mexico.

Tableau geographical visualisation of EF5 tornadoes, mostly occurring in the “tornado alley” region

The above map shows where the strongest tornadoes occurred since 1950. EF5 tornadoes are clearly rare events. And most of them take place in the so called tornado alley. What is also striking here are the straight lines formed by some of the locations. Each line actually corresponds to a unique severe storm episode, which can consecutively generate several strong tornadoes as it moves. The South-West / North-East direction of the lines indicates that most storms seem to follow a same flowing pattern. We will confirm that with a statistical approach in a later paragraph.

Tableau allows other insightful visualisations like the following map where tornadoes’ location are plotted alongside their intensity as a color and the number of deaths they caused as a point size.

Tableau geographical visualisation of EF3 to EF5 tornadoes with number of deaths

As expected, in general, the stronger the tornado, the higher the number of fatalities it causes. The deadliest and costliest tornado in the last half century took place on May 22, 2011, in Joplin, Jasper, Missouri. The brownish color in the next aerial view photography shows the path the tornado took over the city and the extent of damage it caused. 161 persons died, 1150 got injured and the estimated damage cost was $2.8 billion.

Aerial view after the Joplin, Missouri, EF5 tornado (May 22, 2011)

Multinomial target for Supervised Learning Classification

As the EF scale was introduced in February of 2007, a subset of the database from that date was extracted for our study. It resulted in a total of 14538 tornado records. The following bar plot, obtained with Seaborn package, represents the distribution of tornadoes against the magnitude.

Seaborn barplot representing the tornado’s magnitudes distribution

There is a strong class imbalance. Most tornadoes are weak ones. As we had noticed with our Tableau maps, EF5 tornadoes are rare.

I decided to keep the unidentified EFU tornadoes as a separate class to try and predict to which class they are actually more likely to belong to.

The accuracy is the main metric to validate a classification model. We will see later that it is actually not sufficient in the case of strong class imbalance. It is still a good base to start with modelling. The baseline accuracy, to which the accuracy of all future models will be compared to, is the probability of having an EF0 tornado (majority class) without any prior knowledge. It is equal to the number of EF0 tornadoes divided by the total number of tornadoes, which is 0.527.

Feature engineering

The database is very complete and was well maintained by the NCEI. However, preprocessing steps were needed for the majority of features, either to impute missing or NaN values, or to change their format to a more usable one for modelling (for instance the date and time features), or to construct new meaningful features from existing ones (for example the azimuth of the tornado’s path).

Azimuth

The azimuth measures the angle of a line relatively to the geographical North. In other terms, it represents the direction of the line. The heading of the tornado’s straight path is not part of the original features. Nevertheless I wanted to include it in my predictor matrix, as I thought it would be interesting to evaluate the influence the direction of the local main air flow would have on producing tornadoes. The azimuth, in radians clockwise from North, can be obtained with trigonometry from the latitude and longitude of the beginning and end point of the tornado (respectively named A and B):

This new variable was derived thanks to pandas and numpy packages and the latitude and longitude features were later dropped. As some tornadoes last very little time and are therefore almost stationary, it was not rare to get cases where both the latitude and longitude differences in the formula were equal to zero, which led to division by zero errors. I decided to impute those missing values by the median (64.1 degrees), which seemed to represent better the azimuthal distribution than the mean.

Windrose plot of the azimuth distribution of tornadoes’ straight path.

This plot was obtained thanks to Python’s windrose package. The length of the triangles represent the number of tornadoes in the corresponding direction range. The colors from dark blue to dark red correspond to the magnitude.

Interesting fact: the majority of tornadoes head East-NorthEast. This is a known fact, which is confirmed by the Storm Prediction Center:
“Most move from southwest to northeast, or west to east.”

The magnitudes seem distributed evenly across the different directions. This means the azimuth may not bring much to our predictions of the EF scale.

Other Exploratory Data Analysis (EDA) plots are available in the part 3 Jupyter notebook in my Github.

Pre-processing class

In order to make the gathering and processing of new data repeatable and automatic, I built a class containing all the preprocessing functions:

Atom view of the preprocessing class

In particular, the _compute_duration function’s main goal was to obtain the duration each tornado lasts in minutes from its beginning and end date and time. Additionally, the function outputs the year day and the midpoint occurrence time. The main challenge here was to work with datetime formats and its corresponding methods. Another challenge was finding a way, for the year day calculation, not to have a day shift between leap and non-leap years. It was achieved using the calendar library:

Python function used to transform datetime objects into meaningful features for the modelling

For further details about the functions defined for the other transformations of features, please refer to my part 4 Jupyter notebook in Github.

The class can finally easily be called either by instantiating it and applying the fit and transform methods like for many scikit-learn objects (Python’s machine learning package, also called sklearn), or in a pipeline, conjointly with other processes.

Modelling methodology

Standardisation and dummification

After the preprocessing class was applied, the numerical features were normalized using sklearn’s StandardScaler. Standardisation allows getting comparable, unitless features centered on zero and is recommended for easier interpretation, although not mandatory for modelling in most cases.

The categorical features such as the state and the source (type of observer) were “dummified”. This process transforms the column containing n different kind of strings into n columns which titles are the so-named strings and containing 0s and 1s for absence or presence respectively. The new columns have therefore become numerical features and can be used for modelling. Dummification was performed with pandas’ method .get_dummies. After dummification, it is recommended to drop one column because of multicollinearity: any column C can be deducted from all the others; for a specific record, if the other columns all contain 0, then the value in C is 1; if one of the others contain 1, then the value in C is equal to 0. Usually it is advised to drop the column corresponding to the most prevalent string, ie “Texas” for the state and “NWS Storm Survey” for the source.

Pandas transposed DataFrame display of the first 10 records and first 23 columns of the the feature matrix.

Avoiding overfitting through cross validation and train/test split

Just putting in place a model is not enough. A precise methodology has to be followed in order to study its quality and validate it. One frequent issue statistical modelling can lead to is overfitting. It means the model describes the data so well that the process is not repeatable. In other terms, we would not be able to use the same model with a new set of input data. The bias is low, the variance is too high and the model parameters probably too complex. For instance, in classification studies, simple decision tree models have a tendency to overfit the data, especially if their depth is too high. In linear regression studies, the concept of regularisation (Lasso, Ridge) was especially introduced to avoid overfitting.

With multiple variables as input, some of them being categorical, it is not an easy task to create a clear visualisation plot proving or not the overfitting. In the data science community, the common way to tackle this is to carefully choose the models via cross-validation and train/test splits.

The test set is extracted by taking a subset of records, usually corresponding to a random selection of one third of the total dataset. The train set gathers all the remaining records. It was important in our case to use the stratification option in Python’s train_test_split to maintain the same proportion of each tornado magnitude between the train and test set. Basically, with a test set, which we leave apart for most of the model testing, we simulate as if we had obtained new data. It is common then to compare the tested models on the train set via cross validation, and only at the end to confirm there is indeed a small chance of overfitting by analysing the model on the test set.

n-fold cross validation divides the data into n randomly selected sets. Each of these sets is considered as a separate test set, the model is trained on the rest of the data, and validated on the test set. A score is obtained, and this process is repeated n times. This is conveniently achieved with sklearn’s function cross_val_score. We used a fold of 5, which means each tested model was scored five times. The average of those scores gave a good estimation of how well each model performed.

Accuracy as a general validation of models

The main metric used in classifiers is the accuracy score. It is the number of correct predictions over the total number of records, regardless of class. As it is an overall measure of quality, I used accuracy as a first way to select out models and find the best performer:

Table showing the accuracy score of different models trained on train set and validated on test set.

Random Forest model improved the proportion of correctly predicted EF scales by 6% compared to Logistic Regression, and by 39% compared to the baseline. This seems like great results. However, how do we know which magnitudes got predicted the best? How well were the EF5 tornadoes, which are so rare, predicted? Clearly the accuracy score is not enough to be able to answer those questions.

Recall as a tool to examine the quality of predictions, magnitude by magnitude

Other important classification performance measures are the precision, recall, F1-score (harmonic mean of precision and recall) and confusion matrix. In multinomial cases, like ours, there is one precision, one recall and one F1-score per class. This is quite heavy in terms of interpretation of the results. In a perfect world, we want both precision and recall to be as close to one as possible. This is unfortunately rarely possible. Different kind of studies will therefore focus more on having either a good recall or a good precision. It is a tradeoff.

It is important to first define them clearly. For a particular class, the precision measures how many tornadoes were correctly labeled out of the total number of tornadoes predicted as belonging to that class. The recall measures how many tornadoes were correctly labeled out of the total number of tornadoes belonging to that class. Let’s take the example of the EF5 tornadoes. A low precision would mean many EF0 to EF4 tornadoes were wrongly predicted as EF5. A low recall would mean most EF5 tornadoes were wrongly predicted as weaker. Supposing the aim of our study is to help communities better predicting the odds of having a dangerous tornado in the future, then a low precision is more acceptable than a low recall. A low recall would underestimate EF5 tornadoes, and communities could be not sufficiently prepared. On the other hand, even though not ideal, a low precision would wrongly predict weak tornadoes as very strong. Recall is our metric of choice.

Confusion matrix, as the best tool for precise result interpretation

Nevertheless, again, the analysis of both the accuracy and the recall are not sufficient in our study. Enters the confusion matrix! Interpretation of the results can be a very tricky task. I am going to explain why with a concrete example from this project. Throughout the course of the project, I studied and compared many different models, tried different parameters. For example, when doing Natural Language Processing (NLP), subject which I am going to detail later, on the tornado narrative feature, Stochastic Gradient Descent (SGD) model gave good accuracy results. The two following confusion matrices represent the results of two tests, which we will call test A, reference, and test B which differs from A only from its input feature matrix. In NLP, one tries and derives a matrix of significant words, or combination of words, from a narrative. This matrix is used as input to modelling. With test B, I wanted to check the quality of the predictions when using combinations of 1 to 3 words, as opposed to 1 or 2 words in test A. The SGD models used for both tests have the same hyperparameters. It is very important in testing phase to change only one parameter at a time, in our case here the number of combined words (called ngrams).

Test A — Confusion matrix of test set after SGD on NLP matrix (ngrams 1 to 2)
Test B— Confusion matrix of test set after SGD on NLP matrix (ngrams 1 to 3)

With test A, I obtained an accuracy score of 0.780. The lines in the confusion matrix correspond to the true EF scales, the columns to the predicted scales. Numbers on the diagonal, also called “true positives”, represent all the correct predictions. EF0 tornadoes are well predicted (recall of 0.888), whereas only one EF5 tornado was correctly predicted out of 4 (recall of 0.25).

With test B, I obtained a higher accuracy of 0.784, a higher EF0 recall of 0.897, and the same EF5 recall. At first, it seems like a better result, but when looking closer at the confusion matrix, we can notice that yes, in both cases one EF5 tornado was correctly predicted, but in test B, the other three EF5 tornadoes were predicted as weaker ones than for test A. Even one EF5 was predicted as EF0! This is not such a good result after all, although the model performs better in general terms.

To conclude on this model validation section, we should carefully inspect all the results and not rushing into seemingly obvious interpretation. Accuracy, recall and confusion matrix should be analysed conjointly.

Modelling of numerical and categorical features

Now that we have described the methodology, I am going to give more details about the models I tested.

Logistic Regression

Logistic Regression is roughly the equivalent of the Linear Regression applied in classification modelling. It is a good way to start investigating a multinomial problem, as it is quite straight forward to implement, it runs quite fast compared to other models such as Support Vector Machine, it generally provides good results in various kind of studies and thanks to the rankable regression coefficients, the results are interpretable.

Similarly to Linear Regression, Ridge (“l2”) or Lasso (“l1”) penalties can be parameterized in the Logistic Regression. The parameter C, coded in sklearn’s LogisticRegression, is inversely proportional to the regularisation strength. A lower value gives a high regularisation, which means, in the Lasso case, that some regression coefficients of the logistic function may be zeroed out:

Logistic function expressing the probability of having the binomial target y equals to 1. beta0 = intercept, betas = regression coefficients, xi corresponds to the p number of features

The interpretability part of the Logistic Regression is the main reason of this model’s popularity. By ordering the features according to their beta coefficients, we can know which ones are the most influential to predict the target, and which ones are not. In the case of multinomial target, we get regression coefficients for each class. It can become confusing for interpretation. Random Forest, through its most important features output, allows easier overall interpretation.

Random Forest

Just the name of this model is inspiring for a nature passionate like me! Just like real forests, Random Forest algorithms are a regroupement of trees, or more precisely Decisions Trees. Decision Trees make sequential and hierarchical decisions about the outcome variable based on the predictor data. They are based on Hunt’s algorithm, which determines an optimal choice at each node, according to a metric called impurity (either entropy or gini).

Illustration of Random Forest modelling — ensemble of decision trees later aggregated by averaging.

The creativity of the name is obviously not the reason why I chose to test Random Forest classifiers in my study. Although computationally heavy, they are well known to provide good results in different kinds of problems, unlike Decision Trees which tend to overfit. The reason why Random Forest models usually do not overfit the data lies in the way they are built, as an ensemble of Decision Trees trained on bootstrapped subset of features of the input data. The results are then averaged out, and reduce the model variance compared to a simple Decision Tree.

sklearn’s RandomForestClassifier was optimised with GridSearch, a very convenient tool used for tuning hyperparameters. The following ones provided our best accuracy score of 0.732:

  • criterion=’entropy’ : the impurity metric used to optimise the choice at each node.
  • max_depth=18: number of node levels.
  • max_features=0.4: using 40% of the features at each split.
  • min_samples_leaf=1: minimum number of samples required in a leaf (ie output) node.
  • min_samples_split=9: minimum number of samples required to split an internal node.
  • n_estimators=72: number of decision trees.

Upsampling as a means to improve the recall of rare classes

Upsampling of the rarer classes means artificially and randomly introducing more observations corresponding to those classes. It gives more weight to those classes and is in general a good way to increase their recall. It was done on the train set only, through bootstrapping (ie replacement) technique using sklearn’s resample function. In order to do the validation on correctly proportionate data, the scoring of the model was done on the test set.

The table below summarizes the evolution of our modelling through first the kind of model we used, and then through upsampling.

Accuracy and strong tornadoes recall scores of three tested models

Random Forest brings a great uplift compared to Logistic regression, in terms of both the accuracy and recall of rare tornadoes. The upsampling helps better predicting the former, especially EF4 ones (their recall increased by 83%!). It seems though to be at the expense of weaker tornadoes, as the accuracy score is not as good.

We can verify that by analysing the predictions in more details, with confusion matrices:

Confusions matrices, computed on test set, of (from left to right) Logistic Regression, Random Forest and Random Forest trained on upsampled data
  • First observation is about “EFU” tornadoes: these unidentified magnitudes are all predicted as EF0. Although this result is quite obvious when analysing the features of EFU tornadoes (short duration, no reported damage, no casualties), the models predicted this on their own, this is an excellent result.
  • Apart from EF0 magnitudes, Random Forest provided a great increase of all the other True Positives (figures on the matrix diagonal representing correct predictions) as opposed to Logistic Regression.
  • Overall, the models actually predict tornadoes as their own scale or weaker (numbers in the lower left triangle of the matrix are higher than in the upper right one).
  • Upsampling helps better predicting all the tornadoes from EF2 scale and above. But it is at the expense of the EF0 and EF1 predictions.
  • Upsampling led to a great improvement for the strongest EF5 tornadoes: unlike before, none were predicted below the EF3 scale, and 2 out of 4 were predicted correctly.

Random Forest feature importances

Random Forest’s most important feature output is a great opportunity for result interpretation:

  • The most influential feature to predict the EF scale is the maximum width the tornado attains over its whole duration.
Tornado width distribution according to its EF scale

This is quite understandable when looking at the distribution plot. The wider the tornado, the highest its intensity is.

  • Another important feature is the total damage cost, which is an estimate in dollars of the crop and property damages caused by the tornado. As the EF scale is a measure of damage, it is quite logical the cost would be a strong influencer.
  • The length of the tornado’s straight path enters in third. Same observations as the width can be made.
  • Then come into play the location attributes. Again, it is quite normal they appear high up in the list of most important features. Indeed, as we saw before with the geographical distribution analysis of the U.S. tornadoes, most are located in the “tornado alley” region. The strongest ones occur only there. The longitude is therefore a good indicator to predict the EF scale, and the latitude too but to a lesser extent.
  • The least important numerical feature is the number of casualties, which is quite unexpected. This factor may predict well the EF4 and EF5 tornadoes, but as casualties are rarer for weaker tornadoes, it most likely does not distinguish well each scale.
  • The states and observers (source) do not help predictions. Actually, when removing those two categorical features, the accuracy of our model remains almost the same.

Recurrent Neural Network

Deep Learning is a fascinating and very futuristic topic, even though more and more widely used like for instance for image recognition or self driving vehicles.

Neural Networks are typically organized in layers. Layers are made up of a number of interconnected ‘nodes’ which contain an ‘activation function’. Patterns are presented to the network via the ‘input layer’, which communicates to one or more ‘hidden layers’ where the actual processing is done via a system of weighted ‘connections’. The hidden layers then link to an ‘output layer’ where the answer is output”. Recurrent Neural Networks (RNN) differ from traditional Neural Networks by the fact they keep in a “memory” information about what has been calculated so far.

Recurrent Neural Network illustration

This was actually a challenge I put myself up to: implementing a neural network on my dataset! I used keras and inspiration from a great tutorial written by Dr. Jason Brownlee.

Python code for Keras’s RNN implementation and result of the last iterations (ie epochs).

The model I built contains 3 layers. They have to be added one by one.

The first layer has 12 neurons and expects 12 input variables (I selected all the numerical features).

The second hidden layer has 12 neurons too.

Finally, the output layer has 1 neuron and predicts values between 0 and 5 corresponding to the tornado EF scale.

Parameters have not been optimised, and the number of iterations was limited. However, the result was already convincing enough to pursue testing RNN further in the future:

Comparisons of accuracy score obtained by different models

Natural Language Processing for modelling of narratives

NLP

NLP is a wide topic. It is used in many modern algorithms and has many applications such as voice recognition and more generally Artificial Intelligence systems.

My goal in this section was not to use the most advanced NLP techniques but to show that even with just a simple but robust word counter (CountVectorizer in Python), we can already bring significant improvements to the modelling and our predictions.

The motivation for NLP also came from the presence of exhaustive narrative text as one of the variables present in the retrieved dataset. By just looking at few examples of tornadoes, I realised recurrent words seemed to appear. When not many measurements about a tornado could be made, for various reasons such as the unexpectedness of it, its remoteness, the lack of tools, the description of the tornado made by the observer can be an important feature for determining its magnitude.

Example of the event narrative feature for 5 tornadoes

A few observations can be made from the examples of narrative text above:

  • Some reference to supposed EF scale (we should probably exclude that, as it is our target).
  • Some references to numbers, representing estimated wind speed or tornado length.
  • Care should be taken for combination of words like “no reported damage”, which would likely to have the opposite (and correct) influence on predictions compared to single word extraction only (“damage” without the associated “no”).

The part3 and part5 of my notebooks present all the tests I performed. It is worth noting that the term frequency–inverse document frequency (tf-idf) method, which basically gives different weights to words based on their frequency and importance, was tried but was inconclusive, so left aside. Further testing to improve it may be beneficial. I focused my analysis on token counts and model testing.

Count vectorizer

CountVectorizer is an sklearn package which, from a set of string, extracts the words. Various options allow variations in those words. Stop words, like “the” or “and” can be excluded. Combination of words can be introduced thanks to the parameter ngrams. Limiting the number of extracted words based on their frequency can be made with parameter max_features.

When applying CountVectorizer to a vector containing all the observations (ie all tornadoes), we get a sparse matrix with all the extracted words (columns) and their presence or not in each narrative text (rows) thanks to the values 1 or 0 respectively. Densifying the matrix makes it useable as an input to modelling.

Among others, two models are popularly used in NLP: Naive Bayes and Stochastic Gradient Descent.

Naive Bayes

The Naive Bayes classifier is based on the probabilistic Bayes’ theorem , which “describes the probability of an event, based on prior knowledge of conditions that might be related to the event”. It is a simplification of that theorem, assuming that “each feature is conditionally independent of every other feature”.

As the goal was to find an optimal “word matrix” for predicting the tornado magnitude, many tests were needed and I built a function which automatically outputs the accuracy and recall scores. It converts, for both the train and test sets, narrative features into a sparse matrix of words, using a given vectorizer. Then it fits on the train set a multinomial Naive Bayes model using sklearn’s MultinomialNB. Finally the accuracy and recall scores are computed on the test set.

Python function used to score the different NLP with Naive Bayes modelling tests

The function allowed quick implementation of the many tests I wanted to perform. The first one was to determine the optimum number of features. It was done by iterating over CountVectorizer’s max_features parameter and comparing the obtained accuracy scores. The 20000 most important features were chosen, out of an original matrix containing 152211 words or combinations of 2 words.

Stochastic Gradient Descent

Gradient Descent is an algorithm used for optimisation by minimising the derivatives of the loss function. Stochastic Gradient Descent is a popular approximation of the Gradient Descent, because it solves faster and it can handle large datasets, as it can be the case with huge matrices produced by NLP. It achieves this by updating the gradient using only one sample instead of all of them.

Stochastic Gradient Descent illustration: iteratively computing the derivative (ie gradient) of function J(w) and minimising it until it reaches the function’s minimum.

Stochastic Gradient Descent was implemented in Python with sklearn’s SGDClassifier. When comparing with Naive Bayes modelling, using the same input vectorizer, SGD led to a better accuracy score, and overall better results when analysing the recalls and confusion matrices.

As, more often than not, the recall of a certain magnitude improves at the expense of another, it is difficult to judge which model or which test is better. I created the following graph to facilitate the analysis. The X axis represents some of the tests I performed and the different curves represent the evolution of my tests in terms of accuracy and recall scores.

Accuracy and Recall scores of various NLP and model tests

We can draw the following conclusions from those tests, commenting from the plot’s left to right:

  • Stopwords from Python’s Natural Language Toolkit (nltk) instead of the CountVectorizer’s default ones produce slightly lower EF1 and EF3 recalls, so the test was discarded.
  • Creating a new word “nodamage” which takes into account all the possible negations of damage like “not damaged”, “any damage”, etc. improves the accuracy and recall of weak tornadoes. This test was kept.
  • Removing any reference to the target (“EF0”, “EF1”, etc.) obviously worsens most scores, but is necessary for a fair modelling methodology and results assessment.
  • The introduction of the SGD model, instead of Naive Bayes, increased significantly the recall of EF3 and EF4 tornadoes. Only the tornadoes of medium intensity (EF1 and EF2) do not get modelled as well. The test was maintained and all the subsequent tests therefore used SGD.
  • Removing references to numbers lowered all the scores. I therefore decided to keep references to numbers in the narratives. The wind speed or tornado measurements, estimated from the observer, help the predictions.
  • Combinations of 3 words, in addition to single words and 2-word combinations, greatly help the predictions of EF4 and EF5 tornadoes. ngrams 1 to 3 were thus chosen.
  • Introducing the text length as a new feature did not help the modelling.
  • Upsampling of the stronger tornadoes did help their recall, but at the expense of the weaker tornadoes and of the accuracy score. This test was not maintained.

To summarize, our best NLP and modelling was obtained with the following parameters:

  • Use of sklearn’s CountVectorizer.
  • 20000 maximum number of features.
  • Use of “English stopwords”.
  • Creation of a new word “nodamage” to replace any reference to negation of damage in the text.
  • ngrams 1 to 3.
  • Remove reference to EF scales in the texts.
  • Stochastic Gradient Descent model.

Best performance: Random Forest Classifier using both numerical and narrative features

The best accuracy score of 0.812 was obtained by re-integrating the numerical features into the token count matrix, and running a Random Forest Classifier.

As we saw before, it does not mean this is the best model to predict strong tornadoes, but overall, the number of correctly predicted tornado magnitudes is higher.

The Random Forest’s most important features are again a great tool for interpretation and for knowing which variables have a higher influence on correct predictions:

Random Forest most important features

It is very interesting to notice that the numerical features have a higher impact than any text feature.

Words like “destroyed” and “trees”, “snapped” are strong predictors.

Lots of reference to numbers, usually wind speed, are among the top influencers.

We can notice much more work could be performed with NLP. For instance both “damage” and “damaged” could be gathered under the same word. That is what Lemmatization is for. I did test it but with not much success in terms of accuracy improvement.

A look at Time Series Analysis

With a natural phenomena such as tornadoes, able to produce so much damage and so many casualties, studying how their count and strength evolves over time is important. This can be done with Time Series Analysis.

For this analysis, I extracted all the tornadoes from the original dataset, and plotted their count on a per month basis:

Tableau visualisation of the evolution of the total number of tornadoes per month

This plot clearly shows some kind of periodicity over time, with periods of low counts followed by higher ones.

Tornadoes form with special weather conditions:

  • global ones: when cold, dry air from Canada collides with warm, moist air from the Gulf of Mexico.
  • local ones: when air at different altitudes blow at different speeds and the updraft (air moving up from the storm) creates a funnel cloud, which, with the help of rain or hail, bends downwards and touches the ground.
Tableau bar plot of number of tornadoes since 1950 as function of the month of occurrence.

Those conditions are most frequently met during the spring season.

There are some slight variations depending on the latitude (the peak is shifted by a few weeks in the Northern-most states), but overall the busiest months are April to June, which is obvious on the bar plot on the left.

When removing that yearly seasonality by month aggregation, a clear trend appears:

Increasing trend of number of tornadoes over the years

As many may think, this increase may be one of the consequences of Global Warming. However most of it is more likely to be due to better and more accessible measurements and more observations.
There is no scientific consensus yet about whether global warming causes more tornadoes or stronger ones. Some studies even suggest climate change could actually have the opposite effect: the jet stream, one of the thought factors for tornado appearance, could weaken and as a consequence tornadoes could be less frequent.

Further studies

The impressive severe storm dataset gives way to numerous kind of studies. I tested different supervised machine learning classifiers for predicting the EF scale of the tornadoes, based on various features coming from that dataset. Natural Language Processing (NLP) was performed on narratives and we gave a first glance at Time Series Analysis (TSA).

We saw how important the choice of a good metric is for model evaluation. This is a tricky task in classification when there is a strong class imbalance. The accuracy score only is not enough, as it is an overall measure, and does not mean stronger, less frequent, tornadoes are better predicted. It should be analysed alongside the recall score of each magnitude and most importantly the confusion matrix.

Insightful interpretations and visualisations were made, such as maps in Tableau showing the trend of dots representing several EF5 tornadoes consecutively appearing and disappearing, and corresponding to a same extremely severe storm generally moving towards a North-East direction, or the fact that the top three features explaining the EF escale of a tornado are its maximum width, the total damage cost it causes and the length of its straight path.

We gave a first try at Recursive Neural Network, but further testing would be required to optimize the parameters such as the number of nodes and the number of hidden layers. Even though many tests were performed with NLP, it was just a start, as it is a vast domain. Sentiment Analysis would be very interesting to try; deriving the degree of negativity of the narratives could further improve our modelling. With TSA, we could try and forecast the count of tornadoes in the future by autoregressive integrated moving average (ARIMA) modelling.

Integrating other datasets, such as demographics or weather ones, would be valuable for modelling. We also could try and predict other variables such as the damage cost. But mainly, what I would really like to achieve is generating probabilities of having each type of tornadoes in a given future timeframe. If this were done per county, it could help the communities be better prepared in terms of infrastructures and quality of constructions.

--

--