Performing Deep NLP and other ML Models on Tweets

Reuben Kavalov
The Startup
Published in
3 min readFeb 17, 2020

--

This week a friend and I noticed a Kaggle competition that had to do with analyzing Twitter posts. Since I had already completed a project involving Twitter and natural language processing, this piqued my interest and I dove right in. So, the point of the challenge is to be able to predict whether a tweet is about a real disaster/emergency situation or not. An example of a tweet that is deceptively not about real disasters is one that says “The sky was ablaze last night” with a photo of a vibrant sunset. Clearly, even though the chosen words are related to a disaster, in this case there is no emergency.

Our first step was to take a look at the data given to us. It is in the form of two .csv files, one containing the training data and the other — testing data. Upon reading in the training data, we can take a look at it in a pandas dataframe.

There are ~7600 rows in total, with many missing locations and some missing keywords. The tweet itself is seen in the text column, and the target column is either a 1 or a 0, representing if the tweet is about a real disaster or not, respectively. The .csv file containing the testing data looks just like the one above, except it is unlabeled, so there is no target column (that is the column we are trying to predict).

--

--

Reuben Kavalov
The Startup

Data scientist and machine learning engineer with a passion for connecting people through technology and information.