Using pipelines for data loading and analysis is one of the most popular way of building data science projects.
In this project, I used the data from FigureEight for building a web app that classifies messages for customers into appropriate disaster categories. This is extremely useful in responding better to different types of disasters.
The data from FigureEight consists of two csv files. ‘disaster_messages.csv’ contains the actual messages from users and their corresponding ids. ‘disaster_categories.csv’ contains the messages ids and their categories in raw format.
The dataset is a multi-class, multi-label dataset as there are total of 36 categories with the possibility of each message having more than one category. The total number of messages in the dataset are 10246.
After analyzing the count of each category, we see that categories ‘related’ and ‘aid_related’ are the most common with ‘child_alone’ category having the least amount of messages
Another analysis that I did was calculating the number of different categories in a message. Interestingly, most of the categories didn’t even have a category. Most of the messages had either one category or 5 categories
Building an ETL pipeline
An ETL pipeline is a specific kind of data pipeline and very common. ETL stands for Extract, Transform, Load. This pipeline is used to make the data ready for further analysis.
We will start by loading the data from the two csv files and merging them together into one DataFrame. After that, we will transform the data to be ready for model fitting.
For this, first we will extract all the different categories that are present in our data. After this, we will create a separate column for each category which will contain ‘1’ if the message has that category as its label and ‘0’ if it doesn’t. For this purpose, we will use the following function:
Finally, we will just store the transformed DataFrame to a sqlite database. This ends our ETL pipeline
Building a ML Pipeline
Now we will build a ML pipeline that will automate the training of our model on the processed data from ETL pipeline.
For this purpose, we will make use of Scikit-learn’s Pipelines. First, we will load the data from the database and separate it into X (inputs) and Y (ground truth/labels) values. Our inputs are the actual messages and our labels are the one-hot encoded values that represent the categories of these messages.
For our ML method, we will use Random Forest Classifier which will be transformed to output multiple labels according to our dataset. Finally, we will make use of Scikit-learn’s GridSearchCV to find the best parameters during training. Following is the code used for building such a pipeline:
Building a Flask App for Visualization
Finally, we will be building a Flask App which will accept any messages and show the respective categories in that message. It will also show some of the visualizations from the dataset.
Below is an example of classification on the web app.
In this article, I used ETL and ML pipelines along with a Flask web app that classifies messages into their categories for disaster response. There are many improvements that can be made to this such as:
- Use advanced classifiers such as Neural Networks
- Connect the messages to appropriate disaster response organizations
- Deploy the website using a Cloud Provider such as AWS
To see the full code and additional analysis, see the code available on my GitHub here