Tornado Clustering: Disaster Management Insights

Dhruvit Patel
INST414: Data Science Techniques
4 min readMay 2, 2024
Photo by NOAA on Unsplash

Introduction:

Under the right conditions, tornados can become a destructive force, causing millions of dollars of damage to crops and property, and even take lives. Though tornadoes can occur all around the world, the United States averages the most tornadoes a year than any other country. The United States averages 1200 tornadoes per year, while the second most, Canada, averages 100 per year. Through this analysis of 2017 United States tornado statistics, I hope to answer which locations are at higher risk of tornados so they can better prepare for storms. The main stakeholders of this analysis are emergency management teams and Americans in these highly vulnerable areas. The answer will inform stakeholders to take necessary precautions for themselves, their loved ones, and community to be safe if these storms were to hit.

Data Collection:

This dataset was created by NOAA’s National Weather Service. Information of every recorded tornado in the United States is made available and broken down by year. I selected the 2017 information at random. This dataset was available as a CSV file and documentation of the meaning of column names was available on their official website. Some of the fields that are included are month, day, time, timezone, state, magnitude, number of injuries, number of fatalities, loss (property damage by millions of dollars), crop loss, starting latitude, starting longitude, ending latitude, ending longitude, width, length, and more. This is information to plot the location of the tornados and cluster them based on their characteristics.

Besides loading the data in a dataframe using pd.read_csv(), the main data manipulation/ cleaning step that was necessary was to create a new dataframe with the most important columns that can be used for clustering. I identified that magnitude, width, and length were the most important to group the tornados. These characteristics correspond to the destructiveness of the tornados.

Similarity and Clustering:

Euclidean similarity was used for this analysis. This is because I used KMeans to cluster the data. Euclidean was the appropriate choice for this dataset because the three features that were considered were magnitude, length, and width which represent physical measurements where differences in values are significant to the severity of a tornado.

The number of clusters (k) to use was determined by making an elbow plot. The elbow plot took into account possible values of k and inertia, determined by the KMeans() operation. Determining the optimal number of clusters, I select the ‘elbow’ of the plot, which is the point where the inertia starts to level off. Based on the graph, that value was 4.

The four clusters that are depicted in the graph are colored purple, pink, orange, and yellow. Each of these clusters are representations of the severity of the storm, taking into account magnitude, length, and width. Purple indicates tornadoes of lower severity and yellow indicates extremely severe storms. This makes sense because there are far less tornadoes in the yellow cluster because they are less common than less severe tornadoes.

Analysis:

This cluster plot was plotted by the tornadoes’ starting latitudes and longitudes, making the plot look like a map of the United States. This and the KMeans clustering that was performed makes it simple to view areas of the highest tornado occurrences. This analysis shows that the midwest and middle south regions have the most tornadoes and some of the most devastating. This is common information due to the amount of farmland and plains in the area make ideal conditions for tornadoes. However, It was interesting to see that there were a large number of tornadoes on the east coast as well. Specifically South of the Appalachian mountain and around them. This area also has a large amount of orange clustered tornadoes, the second most devastating tornadoes. Homeowners, community members, and emergency management teams can take into account the geographic locations and severities of tornadoes to better improve their storm preparations, such as bunkers, safe zones, and response time.

Limitations:

The main limitation comes from the dataset itself. The data is only tornado information of 2017. This is outdated information that will not have much impact on predicting the locations of tornadoes this year. Also, tornadoes are a very spontaneous phenomenon that occur when the weather patterns align. Predicting them now is near impossible. With that being said, being prepared in the locations that are affected the most by tornadoes is crucial for safety and response.

References:

https://www.spc.noaa.gov/wcm/#data

https://www.rmets.org/metmatters/tornadoes-around-world#:~:text=The%20United%20States%20averages%20over,average%20of%20100%20per%20year.

Github:

https://github.com/dhruvitpatel5/Mod4Tornados/blob/main/mod4.ipynb

--

--