Natural language processing (NLP) in various forms has been around for decades, but has seen immense growth in recent years thanks to software libraries that make it easier than ever to analyze and perform machine learning on free text. A recent Kaggle contest to classify tweets between those discussing a real emergency or not appears simple on the surface, but provides compelling challenges even for people used to working with text and looking to experiment with newer techniques.
Part 1: Data Exploration — You’re here!
What we’ll be doing in this section
- Overview of the challenge
- Importing the data into Python
- Exploratory data analysis (EDA)
In following parts, we’ll explore feature engineering & selection, take a look at some great workhorse ML models, then move on to trying out H2O’s AutoML package and BERT embeddings.
Why do we care?
The Kaggle link for this challenge provides a concise example of why we would even want to classify these tweets in the first place, and I encourage you to give it a quick read before continuing.
Let’s get started!
I’ll be using a combination of the wonderful Jupyter Lab and Plotly Express, an easy-to-use wrapper for the wonderfully handy Plotly visualization package.
# Pandas for data manipulation
import pandas as pd# Plotly Express for interactive charting inside jupyter lab
import plotly.express as px
Plotly Express has built-in support for offline charting inside Jupyter. Using vanilla Plotly to meet the same goal required many other packages and dependencies.
Importing tweet data
The Kaggle dataset is a preprocessed version of an open dataset from Data for Everyone. Kaggle has conveniently split the tweets into training and test .csv files. The training.csv
file will be used for building our models, while test.csv
will be used for model evaluation on Kaggle’s leaderboard.
# Read in train.csv and take a look at the first few rows
train_df = pd.read_csv("data/train.csv")
train_df.head()
Not counting the id
column, we’re given just three features and our binary target.
- id: Unique identifier for each tweet
- keyword: A single keyword from the tweet. This can be blank.
- location: The location of the tweet. This field may also be blank.
- text: The raw text of the tweet
- target:
1
for disaster-related, and0
for not disaster
High-level look at our data
train_df.info()
We can see that over 99% of keyword
is filled in, a good sign if we want to try and use it as a predictive feature. We have missing values for about a third of the location
feature however. We also see that we’re not missing any tweets or targets.
Panda’s DataFrame.describe()
method provides a simple and elegant way of getting basic exploratory statistics on a dataframe. This is especially useful when working with numeric features (which we currently don’t have).
By default the method will ignore non-numeric columns such as text
, but we can force the method to include those columns by including the include='all'
parameter:
train_df.describe(include='all')
We see that there are only 221 keywords being used, while many thousands of unique locations exist.
A categorical feature with a large list of potential values can be described as having high cardinality, and is generally less desirable than one with low cardinality. We’ll see later on how to encode a categorical feature into a format usable for a machine learning algorithm, and the effects of high cardinality vs. low.
Check for class imbalances
Classification can be much more difficult when a prediction label is rare and is referred to as class imbalance. We’ll do a quick check on our target
column to see if we need to perform any imbalance mitigation.
# Instantiate histogram using plotly express
fig = px.histogram(train_df,
x = "target",
width = 500,
height = 500)fig.update_layout(xaxis_type='category',
title_text='Class distribution')fig.show()
Lucky for us, our 1
class is not rare. It’s important to check this early with each new dataset as it can drastically change how you approach the problem.
Always remember to check for class imbalances!
Missing data
We did a quick check early on by running train_df.info()
and saw there was some amount of missing data in keyword
and location
, with the latter exhibiting a much higher level of blank values.
There are many approaches to dealing with missing data. Deciding between them is dependent on one’s goals, whether the feature appears important to task, domain knowledge, and time available:
- Fill in missing data
- Drop the rows that are missing data
- Drop the feature column
To decide on a strategy, we’ll take a closer look at the keyword
and location
data.
"Keyword"
exploration
The Series.value_counts()
method is another great Pandas tool for checking out the most and least common values in a column.
# Value counts
train_df[‘keyword’].value_counts(dropna=False)
Pandas Tips!
Remember that selecting a single column of our dataframe returns a Pandas series. The
.value_counts()
method only works with a series.The
dropna=False
parameter counts up missing values. By default the method will ignore those.
Interesting things we see even from 10 entries:
- We kept our missing values and see that as our most common “value” for the column
- The
%20
string appears twice and unsurprisingly it is the ASCII encoding for a space Epicentre
uses the British spelling, which may be useful information downstream if we are looking to combine similar words or perform other NLP operations
Let’s also take a look at our keywords in alphabetical order.
# Sort keywords alphabetically
train_df['keyword'].value_counts(dropna=False).sort_index()
Probably the most interesting thing we see are that separate keywords exists for wreck
, wreckage
, and wrecked
. This may give us ideas about combining keyword terms to further shrink the cardinality of the column.
Interactive bar chart of keywords
If you installed Plotly Express, running this code block will create an interactive chart showing all the keywords from least common to most common.
# Create new dataframe with our count data
train_keywords_bar_df = train_df['keyword'].value_counts().reset_index()
train_keywords_bar_df.columns = ['keyword', 'count']# Bar chart
fig = px.bar(train_keywords_bar_df,
x = "keyword",
y = "count"
)fig.update_layout(xaxis_type='category',
title_text='Keywords')fig.show()
Pandas tip!
The
px.bar()
function works very easily with Pandas dataframes. By passing in a dataframe, we can just tell the function which column to use for the x-axis. But runningtrain_df['keyword'].value_counts(dropna=False)
returns a Pandas series. We can convert it to a dataframe by adding.reset_index()
which creates a new numeric index for each row and moves our keywords into a new column named “index.” We overwrite the new dataframe’s column names with an array[‘keyword', ‘count’]
. Finally we pass “keyword” as ourx
parameter, and “count” as oury
.
Exploring the keywords interactively in your own Jupyter notebook may be easier than exporting and reading through a .csv dump, especially in these early stages.
Is merely having a keyword predictive of the target?
Before going deeper into each keyword, we can also do a quick check to see if merely having or not having a keyword has any predictive value. In fact, this represents a common pattern we will use often: create easy-to-derive features and check them for usefulness before moving to more complex and time-consuming feature engineering.
# Copy a slice of our dataframe
keyword_target_df = train_df[['keyword', 'target']].copy()# Add an indicator column that checks to see if a keyword exists
keyword_target_df['keyword_missing'] = keyword_target_df['keyword'].apply(lambda x: pd.isna(x))# Drop the actual keyword column
keyword_target_df.drop('keyword', axis=1, inplace=True)# Check our output
keyword_target_df.head()
# Visualize as barcharts
fig = px.histogram(keyword_target_df, x="keyword_missing", color="target", facet_col='target', histnorm='probability density')fig.update_layout(xaxis_type='category',
yaxis_title='%',
width = 1000,
height = 500,
title_text='Keywords Missing vs. Available % by Target Class')fig.show()
We find that both targets have small numbers of missing keywords, but the differences between the classes is not striking. The existence or lack of a keyword may still be predictive, but we may find more useful features elsewhere.
Are individual keyword terms predictive?
Let’s take a closer look at our raw keywords and see if any terms are correlated with the target classes.
# Create a pivot table to count the numbers of classes per unique keyword
keyword_by_target_dist = pd.pivot_table(
train_df[['id', 'keyword', 'target']],
index='keyword', columns='target', aggfunc=lambda x: len(x.unique()), fill_value=0
)# Look at output
keyword_by_target_dist
We already see some interesting distributions from our quick check. The words wreck
, wreckage
, and wrecked
all sound similar, but have decidedly one-sided class distributions, and wreckage
has a distribution completely different from the other two words.
Aside: Lemmatization & Stemming
Two common NLP techniques are lemmatization and stemming which can help normalize terms that have the same semantic meanings but have taken different grammatical forms. We’ll talk about these in more detail later, but here are a couple examples:
- Lemmatization:
is
andare
are transformed to a standardis
- Stemming: Plural terms like
ships
are transformed to the singular versionship
These techniques can help reduce unneeded dimensionality in our data along with reducing noise.
We can create a new feature to quantify the bias a keyword has towards a target of 1
:
# Calculate the percentage of keyword instances in target class 1
keyword_by_target_dist['keyword_bias'] = keyword_by_target_dist.apply(lambda row: row['id'][1]/(row['id'][0] + row['id'][1]), axis=1)# Sort from high to low bias
keyword_by_target_dist = keyword_by_target_dist.sort_values(by='keyword_bias', ascending=False)# Look at our output
keyword_by_target_dist
Like our earlier interactive bar chart of keywords, we can chart keyword bias and explore it interactively in Jupyter.
# Bar chart
fig = px.bar(keyword_by_target_dist.reset_index()[['keyword', 'keyword_bias']],
x = "keyword",
y = 'keyword_bias'
)fig.update_layout(xaxis_type='category',
title_text='Keyword Bias Towards Target')fig.show()
We find a spectrum of terms that are highly correlated with disaster tweets, some are inversely correlated, while a large swath sit someplace in between to two extremes. We saw earlier that similar terms could have vastly different correlations to our class targets as well. These qualities may make stemming and lemmatization of our keywords slightly less attractive as we would lose some of the specificity inherent in our raw keywords. We do need to consider balancing that with model generality and the risk of overfit later on; training a model on these very specific keywords may actually hurt accuracy down the line when it’s faced with new text and keywords.
Next steps
In Part 2, we’ll look at Location
along with some other meta-features of our dataset.
Much more can still be done with Keyword
! One can try stemming and lemmatizing the keywords and re-examining class correlation, or checking and normalizing for any region-specific spelling differences that might exist.
Generally one will have many more EDA ideas than the time available to actually try them all out. Take advantage of these Kaggle datasets to explore ideas without the pressure of a deliverable deadline!