Using Data Science Pipelines for Disaster Response

Nouman
Nouman
Jun 5, 2020 · 4 min read

Using pipelines for data loading and analysis is one of the most popular way of building data science projects.

Image for post
Image for post

In this project, I used the data from FigureEight for building a web app that classifies messages for customers into appropriate disaster categories. This is extremely useful in responding better to different types of disasters.

Data Analysis

The data from FigureEight consists of two csv files. ‘disaster_messages.csv’ contains the actual messages from users and their corresponding ids. ‘disaster_categories.csv’ contains the messages ids and their categories in raw format.

The dataset is a multi-class, multi-label dataset as there are total of 36 categories with the possibility of each message having more than one category. The total number of messages in the dataset are 10246.

After analyzing the count of each category, we see that categories ‘related’ and ‘aid_related’ are the most common with ‘child_alone’ category having the least amount of messages

Image for post
Image for post

Another analysis that I did was calculating the number of different categories in a message. Interestingly, most of the categories didn’t even have a category. Most of the messages had either one category or 5 categories

Image for post
Image for post

Building an ETL pipeline

An ETL pipeline is a specific kind of data pipeline and very common. ETL stands for Extract, Transform, Load. This pipeline is used to make the data ready for further analysis.

We will start by loading the data from the two csv files and merging them together into one DataFrame. After that, we will transform the data to be ready for model fitting.

For this, first we will extract all the different categories that are present in our data. After this, we will create a separate column for each category which will contain ‘1’ if the message has that category as its label and ‘0’ if it doesn’t. For this purpose, we will use the following function:

Finally, we will just store the transformed DataFrame to a sqlite database. This ends our ETL pipeline

Building a ML Pipeline

Now we will build a ML pipeline that will automate the training of our model on the processed data from ETL pipeline.

For this purpose, we will make use of Scikit-learn’s Pipelines. First, we will load the data from the database and separate it into X (inputs) and Y (ground truth/labels) values. Our inputs are the actual messages and our labels are the one-hot encoded values that represent the categories of these messages.

Now, we will build a pipeline for our model. The first thing we have to do is change our input to numerical data. For this, we will apply CountVectorizer and TfidfTransformer to our messages.

For our ML method, we will use Random Forest Classifier which will be transformed to output multiple labels according to our dataset. Finally, we will make use of Scikit-learn’s GridSearchCV to find the best parameters during training. Following is the code used for building such a pipeline:

Building a Flask App for Visualization

Finally, we will be building a Flask App which will accept any messages and show the respective categories in that message. It will also show some of the visualizations from the dataset.

Below is an example of classification on the web app.

Image for post
Image for post
Front-end of the Flask Web app

Conclusion:

In this article, I used ETL and ML pipelines along with a Flask web app that classifies messages into their categories for disaster response. There are many improvements that can be made to this such as:

To see the full code and additional analysis, see the code available on my GitHub here

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Nouman

Written by

Nouman

Software Engineer who loves Data Science and building products related to data. Connect with me on LinkedIn here: https://www.linkedin.com/in/nouman10/

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Nouman

Written by

Nouman

Software Engineer who loves Data Science and building products related to data. Connect with me on LinkedIn here: https://www.linkedin.com/in/nouman10/

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store