🔥Lit or Arson? Disaster Tweet Classification Part Two: Starting Feature Engineering & Selection

Alex Lau
7 min readApr 27, 2020

Recap of Part 1

In Part 1 we:

  • Looked at the Kaggle problem
  • Imported the data using Pandas
  • Looked at high-level statistics of our dataset
  • Checked for class imbalances
  • Took a close look at the Keyword feature

Plan for Part 2

Now we’ll:

  • Look at the Location feature
  • Tweet character and token length
  • Use Mann Whitney U test for feature selection

Part 1: Data Exploration

Part 2: Starting Feature Engineering & Selection — You’re here!

Location exploration

We’ll next take a look at the Location feature in a very similar fashion to what we did earlier for Keyword.

# Value counts
train_df['location'].value_counts(dropna=False)

--

--

Alex Lau

Data scientist, cat foster father, D&D wannabe — California