🔥Lit or Arson? Disaster Tweet Classification Part One: Data Exploration

Alex Lau
9 min readMar 29, 2020

--

Natural language processing (NLP) in various forms has been around for decades, but has seen immense growth in recent years thanks to software libraries that make it easier than ever to analyze and perform machine learning on free text. A recent Kaggle contest to classify tweets between those discussing a real emergency or not appears simple on the surface, but provides compelling challenges even for people used to working with text and looking to experiment with newer techniques.

Part 1: Data Exploration — You’re here!

Part 2: Starting Feature Engineering & Selection

What we’ll be doing in this section

  • Overview of the challenge
  • Importing the data into Python
  • Exploratory data analysis (EDA)

In following parts, we’ll explore feature engineering & selection, take a look at some great workhorse ML models, then move on to trying out H2O’s AutoML package and BERT embeddings.

Why do we care?

The Kaggle link for this challenge provides a concise example of why we would even want to classify these tweets in the first place, and I encourage you to give it a quick read before continuing.

Let’s get started!

I’ll be using a combination of the wonderful Jupyter Lab and Plotly Express, an easy-to-use wrapper for the wonderfully handy Plotly visualization package.

# Pandas for data manipulation
import pandas as pd
# Plotly Express for interactive charting inside jupyter lab
import plotly.express as px

Plotly Express has built-in support for offline charting inside Jupyter. Using vanilla Plotly to meet the same goal required many other packages and dependencies.

Importing tweet data

The Kaggle dataset is a preprocessed version of an open dataset from Data for Everyone. Kaggle has conveniently split the tweets into training and test .csv files. The training.csv file will be used for building our models, while test.csv will be used for model evaluation on Kaggle’s leaderboard.

# Read in train.csv and take a look at the first few rows
train_df = pd.read_csv("data/train.csv")
train_df.head()

Not counting the id column, we’re given just three features and our binary target.

  • id: Unique identifier for each tweet
  • keyword: A single keyword from the tweet. This can be blank.
  • location: The location of the tweet. This field may also be blank.
  • text: The raw text of the tweet
  • target: 1 for disaster-related, and 0 for not disaster

High-level look at our data

train_df.info()

We can see that over 99% of keyword is filled in, a good sign if we want to try and use it as a predictive feature. We have missing values for about a third of the location feature however. We also see that we’re not missing any tweets or targets.

Panda’s DataFrame.describe() method provides a simple and elegant way of getting basic exploratory statistics on a dataframe. This is especially useful when working with numeric features (which we currently don’t have).

By default the method will ignore non-numeric columns such as text, but we can force the method to include those columns by including the include='all' parameter:

train_df.describe(include='all')
Table is truncated as we don’t have meaningful numeric statistics

We see that there are only 221 keywords being used, while many thousands of unique locations exist.

A categorical feature with a large list of potential values can be described as having high cardinality, and is generally less desirable than one with low cardinality. We’ll see later on how to encode a categorical feature into a format usable for a machine learning algorithm, and the effects of high cardinality vs. low.

Check for class imbalances

Classification can be much more difficult when a prediction label is rare and is referred to as class imbalance. We’ll do a quick check on our target column to see if we need to perform any imbalance mitigation.

# Instantiate histogram using plotly express
fig = px.histogram(train_df,
x = "target",
width = 500,
height = 500)
fig.update_layout(xaxis_type='category',
title_text='Class distribution')
fig.show()

Lucky for us, our 1 class is not rare. It’s important to check this early with each new dataset as it can drastically change how you approach the problem.

Always remember to check for class imbalances!

Missing data

We did a quick check early on by running train_df.info() and saw there was some amount of missing data in keyword and location, with the latter exhibiting a much higher level of blank values.

There are many approaches to dealing with missing data. Deciding between them is dependent on one’s goals, whether the feature appears important to task, domain knowledge, and time available:

  • Fill in missing data
  • Drop the rows that are missing data
  • Drop the feature column

To decide on a strategy, we’ll take a closer look at the keyword and location data.

"Keyword" exploration

The Series.value_counts() method is another great Pandas tool for checking out the most and least common values in a column.

# Value counts
train_df[‘keyword’].value_counts(dropna=False)

Pandas Tips!

Remember that selecting a single column of our dataframe returns a Pandas series. The .value_counts() method only works with a series.

The dropna=False parameter counts up missing values. By default the method will ignore those.

Interesting things we see even from 10 entries:

  • We kept our missing values and see that as our most common “value” for the column
  • The %20 string appears twice and unsurprisingly it is the ASCII encoding for a space
  • Epicentre uses the British spelling, which may be useful information downstream if we are looking to combine similar words or perform other NLP operations

Let’s also take a look at our keywords in alphabetical order.

# Sort keywords alphabetically
train_df['keyword'].value_counts(dropna=False).sort_index()

Probably the most interesting thing we see are that separate keywords exists for wreck, wreckage, and wrecked. This may give us ideas about combining keyword terms to further shrink the cardinality of the column.

Interactive bar chart of keywords

If you installed Plotly Express, running this code block will create an interactive chart showing all the keywords from least common to most common.

# Create new dataframe with our count data
train_keywords_bar_df = train_df['keyword'].value_counts().reset_index()
train_keywords_bar_df.columns = ['keyword', 'count']
# Bar chart
fig = px.bar(train_keywords_bar_df,
x = "keyword",
y = "count"
)
fig.update_layout(xaxis_type='category',
title_text='Keywords')
fig.show()

Pandas tip!

The px.bar() function works very easily with Pandas dataframes. By passing in a dataframe, we can just tell the function which column to use for the x-axis. But running train_df['keyword'].value_counts(dropna=False) returns a Pandas series. We can convert it to a dataframe by adding .reset_index() which creates a new numeric index for each row and moves our keywords into a new column named “index.” We overwrite the new dataframe’s column names with an array [‘keyword', ‘count’]. Finally we pass “keyword” as our x parameter, and “count” as our y.

Chart will be interactive when running in one’s own Jupyter environment

Exploring the keywords interactively in your own Jupyter notebook may be easier than exporting and reading through a .csv dump, especially in these early stages.

Is merely having a keyword predictive of the target?

Before going deeper into each keyword, we can also do a quick check to see if merely having or not having a keyword has any predictive value. In fact, this represents a common pattern we will use often: create easy-to-derive features and check them for usefulness before moving to more complex and time-consuming feature engineering.

# Copy a slice of our dataframe
keyword_target_df = train_df[['keyword', 'target']].copy()
# Add an indicator column that checks to see if a keyword exists
keyword_target_df['keyword_missing'] = keyword_target_df['keyword'].apply(lambda x: pd.isna(x))
# Drop the actual keyword column
keyword_target_df.drop('keyword', axis=1, inplace=True)
# Check our output
keyword_target_df.head()
# Visualize as barcharts
fig = px.histogram(keyword_target_df, x="keyword_missing", color="target", facet_col='target', histnorm='probability density')
fig.update_layout(xaxis_type='category',
yaxis_title='%',
width = 1000,
height = 500,
title_text='Keywords Missing vs. Available % by Target Class')
fig.show()

We find that both targets have small numbers of missing keywords, but the differences between the classes is not striking. The existence or lack of a keyword may still be predictive, but we may find more useful features elsewhere.

Are individual keyword terms predictive?

Let’s take a closer look at our raw keywords and see if any terms are correlated with the target classes.

# Create a pivot table to count the numbers of classes per unique keyword
keyword_by_target_dist = pd.pivot_table(
train_df[['id', 'keyword', 'target']],
index='keyword', columns='target', aggfunc=lambda x: len(x.unique()), fill_value=0
)
# Look at output
keyword_by_target_dist

We already see some interesting distributions from our quick check. The words wreck, wreckage, and wrecked all sound similar, but have decidedly one-sided class distributions, and wreckage has a distribution completely different from the other two words.

Aside: Lemmatization & Stemming

Two common NLP techniques are lemmatization and stemming which can help normalize terms that have the same semantic meanings but have taken different grammatical forms. We’ll talk about these in more detail later, but here are a couple examples:

- Lemmatization: is and are are transformed to a standard is

- Stemming: Plural terms like ships are transformed to the singular version ship

These techniques can help reduce unneeded dimensionality in our data along with reducing noise.

We can create a new feature to quantify the bias a keyword has towards a target of 1:

# Calculate the percentage of keyword instances in target class 1 
keyword_by_target_dist['keyword_bias'] = keyword_by_target_dist.apply(lambda row: row['id'][1]/(row['id'][0] + row['id'][1]), axis=1)
# Sort from high to low bias
keyword_by_target_dist = keyword_by_target_dist.sort_values(by='keyword_bias', ascending=False)
# Look at our output
keyword_by_target_dist

Like our earlier interactive bar chart of keywords, we can chart keyword bias and explore it interactively in Jupyter.

# Bar chart
fig = px.bar(keyword_by_target_dist.reset_index()[['keyword', 'keyword_bias']],
x = "keyword",
y = 'keyword_bias'
)
fig.update_layout(xaxis_type='category',
title_text='Keyword Bias Towards Target')
fig.show()

We find a spectrum of terms that are highly correlated with disaster tweets, some are inversely correlated, while a large swath sit someplace in between to two extremes. We saw earlier that similar terms could have vastly different correlations to our class targets as well. These qualities may make stemming and lemmatization of our keywords slightly less attractive as we would lose some of the specificity inherent in our raw keywords. We do need to consider balancing that with model generality and the risk of overfit later on; training a model on these very specific keywords may actually hurt accuracy down the line when it’s faced with new text and keywords.

Next steps

In Part 2, we’ll look at Location along with some other meta-features of our dataset.

Much more can still be done with Keyword! One can try stemming and lemmatizing the keywords and re-examining class correlation, or checking and normalizing for any region-specific spelling differences that might exist.

Generally one will have many more EDA ideas than the time available to actually try them all out. Take advantage of these Kaggle datasets to explore ideas without the pressure of a deliverable deadline!

--

--

Alex Lau

Data scientist, cat foster father, D&D wannabe — California