Part 2: Creating a Machine Learning Pipeline

In part 2 of the Classifying Disaster Messages series, I will be handling NLP tasks and building a machine learning pipeline to create a model that classifies messages sent during different disasters to be sent to appropriate relief agencies. In part 1, I showed how to make an ETL pipeline to prepare the model's data.

Loading Data

Reviewing the data that I loaded from the SQLite database, we can see that our dataset has messages pulled from social media and direct texts translated into English, and flags for the categories that the messages were placed in. A computer can analyze numbers quite…


Part 1: Building an ETL Pipeline

There is currently a challenge for disaster response professionals to respond to messages sent out during natural disasters. Some messages can be important or relevant while others are not. Now typically after a disaster, there are millions of communications, typically in the form of direct messages or social media posts, that disaster response organizations have to filter out for importance. Such a problem cannot be so simply tackled with the power of data science and machine learning.

In this series, I will show some ways that we can attempt to classify messages sent out during a disaster so that they…


Predicting user churn or rate of attrition, is a challenging problem for any company to deal with.

Many different factors come into play as to why a particular user may or may not churn. In this project I use PySpark to analyse and predict churn using data similar to those of companies like Spotify and Apple Music.

Why PySpark?

I chose to use PySpark because of its scalability and speed. PySpark helps work with Resilient Distributed Datasets (RDD). Spark is also faster than other frameworks since it stores data in memory.

Data

For the first part of my analysis, I worked on a smaller subset of the whole 12GB dataset. Once I finish my analysis and modeling, I will deploy…


The Coronavirus took a huge toll on the United States and really showed that the country was not prepared for such an event. I decided that maybe there is some data that can be used to find counties that are more vulnerable to the spread of a pandemic than others.

Using county data collected from New York Times, Census, CDC, and Google, I created a dataset to predict if a particular county would have 0.1%, 0.2%, and 1% of the population be infected with COVID-19.

Arman Berek

Certified Data Science Professional with a passion for learning and teaching.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store