How to Build a Disaster Message Classification App with Supervised Machine Learning and Natural Language Processing

Published in

The Startup

6 min readJan 12, 2021

Following a disaster, millions and millions of communications, either directly or via Twitter and Facebook, are sent right when disaster response organizations have the least capacity to filter and pull out the messages that have higher priority. Bystanders often post about what is happening and create an influx of information on social media faster and more informative than news reports. However, often only one in every thousand of those messages has actual relevant content to the disaster response professionals to act upon.

Disaster message classification represents a significant challenge for resource distribution:

Relevant messages need to be sent to different organizations that will take care of different parts of the same disaster, such as water and medical supplies;
Quickly connect the right level of assistance with the people who have higher priority.

The data provided by Figure Eight contains over 30000 real messages sent to disaster response organizations during major disasters including the 2010 Haiti earthquake, the earthquake in Chile in 2010, floods in Pakistan in 2010, Superstorm Sandy in the U.S.A. in 2012, and many messages spanning over 100 of different disasters. The messages have been taken, combined, and relabeled so that are consistent across the different disasters, allowing us to investigate different trends and build supervised Machine Learning -based models.

For instance, when tokenizing the incoming messages, we might encounter the keyword “water”. However, very few times that word will map to someone who actually needs fresh water, and also it will not map those who say they are thirsty, but don’t use the keyword “water”.

ML in combination with Natural Language Processing (NLP) techniques is going to be significantly more accurate than mere keyword searching — allowing disaster response organizations to pull out the most crucial messages, redirect specific requests to the proper organizations, and drastically reducing the response time.

You can find the project repository on my GitHub.

Project structure

We divide this project into three sections:

Data Processing: I will build an ETL (Extract, Transform, and Load) pipeline that processes messages and category data from CSV file, and load them into an SQLite database which our ML pipeline will then read from to create and save a multi-output supervised learning model.

Machine Learning Pipeline: I will split the data into a training set and a test set. Then, create an ML pipeline that uses NLTK and GridSearchCV to output a final model that predicts message classifications for the 36 categories (multi-output classification)

Web development: In the last step, I will display the final results in a Flask web app that classifies messages in real-time.

ETL pipeline

The data provided by Figure Eight is divided into two csv files:

disaster_categories.csv: Categories of the messages
disaster_messages.csv: Multilingual disaster response message

The pipeline’s steps:

Merge the two datasets using the common id
Split categories into 36 separate category columns
Iterate through the category columns and convert category values to binary
Replace the categories column in df with new category columns.
Drop duplicates
Store the clean dataset in a SQLite database

At the end of the ETL pipeline, we obtain an SQL table containing the messages and coded by 36 categories.

Machine Learning Pipeline — Data Modelling

I will now use the data to train a model that will take the message column as input and output classification results on the other 36 message categories.

The components I will use in the ML pipeline are:

CountVectorizer: Converts a collection of text documents to a matrix of token counts [1]

TfidfTransformer Transforms a count matrix to a normalized tf or tf-idf representation that should reflect how important a word is to a document in a collection of texts. Tf means term-frequency it is computed by dividing the number of times a term occurs in the document by the total number of terms in the document. tf-idf means term-frequency times inverse document-frequency and it is computed by taking the logarithmic of the total number of documents in the corpus divided by the number of documents where the term occurred. [2]

MultiOutputClassifier Multi-target classification. This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification.[3]

The steps as follows:

Loads data from the SQLite database
Extract X (messages) and y (categories) variables from the data for the modeling
Use nltk library to case normalize, lemmatize, tokenize the text and remove stopwords to help reduce some burden on the model.

4. Build an ML model that takes in the message column as input and output classification results on the other 36 categories in the dataset.

5. Test the model: Report the f1 score, precision, and recall for each output category of the dataset. We can do this by iterating through the columns and calling sklearn classification_reporton each category.

Precision is the ratio of correctly predicted positive observations of the total predicted positive observations. Of all messages predicted to fall into one of the 36 categories, how many actually fell into the right category? High precision relates to the low false-positive rate.

Recall is the ratio of correct positive predictions out of all positive predictions that could have been made and provides an indication of missed positive predictions.

F1 score is the harmonic mean of Precision and Recall and is a measure model’s accuracy, such that the highest possible value is 1, indicating perfect precision and recall, and the lowest possible value is 0.

You can find further knowledge on classification evaluation metrics here .

6. Tune the model: use GridSearchCV search to find better parameters. GridSearchCV is a tool that allows you to define a “grid” of parameters, or a set of values to check by automating the process of trying out all possible combinations of values. [4]

7. Test the tuned model with classification_report

8. Evaluate results

The accuracy of the model is quite high, whereas recall is very low, meaning that the dataset is highly imbalanced — which occurs when there is an uneven representation of classes. In the real-world, imbalanced datasets are very common, such as in fraud detection and medical diagnosis. For further insights on how to tackle an imbalanced dataset go to Tactics to Combat Imbalanced-Classes in your Machine Learning dataset and Medium post.

Finally we:

8. Export the final model as a pickle file

Deploy a Flask Web App

A web application has been developed as user interface. When a user inputs a message into the app and clicks on the Classify Message button, the app shows how the message is classified by highlighting the categories in green. For the suggested message “we are more than 50 people on the street. Please help us find tent and food” we get the following categories: