An example of “expediting” the data exploration when you supposedly can’t

Laurae: This post is about the result of an exploration on a dataset that would have required over 250GB to be loaded in memory. It uses H2O Flow, Tableau Public, and JMP, on a 16GB laptop (the memory never exceeded 8GB usage). What do you do when you have a dataset you don’t know its content, but know the objective? The post content was kept intact, but formatting was changed to comply to Medium text editor. You can find post originally at Kaggle.

Expeditive file analysis part

A quick analysis on the label:

  • We are playing with an imbalance ratio of 1:172 (172 negatives for 1 positive): 99.41% accuracy if you predict the negative case only
  • 1183747 rows
  • 1176868 negative labels
  • 6879 positive labels

The evolution of the metric is the following (True Positive vs True Negative — FP/FN are straightforward):

Interactive MCC: here at Tableau

A quick analysis on train_numeric.csv:

  • Matrix size: [1183747 x 969] (969 features, 1183747 observations)
  • Missing values do not seem to be missing at random. See this webpage for more details
  • Sparse
  • Contains the label we need to predict

Sneak peak:

A quick analysis on train_categorical.csv:

  • Matrix size: [1183747 x 2141] (2141 features, 1183747 observations)
  • Missing values do not seem to be missing at random. See this webpage for more details (edit: seems there are import errors when loading the dataset, need to check but these missing value counts are not reliable at all).
  • Many columns have 1183747 missing values or near that (which means you can remove a lot of them already!).
  • Extremely sparse
  • Lot of 0 only
  • L1_S24_F7XX to L1_S24_F9XX (replace XX by numbers) seems linked (same zeroes)
  • Loads of features with NOTHING inside (just sparse, see below)

Sneak peak:

A quick analysis on train_data.csv:

  • Matrix size: [1183747 x 1157] (1157 features, 1183747 observations)
  • Missing values do not seem to be missing at random.
  • Sparse
  • Dates are numeric already, seems “normalized” overall by days?

Sneak peak:

Expeditive modeling part

Elastic Net on raw numeric data, clearly bad:

Random Forest on raw numeric data, “not bad” but can do much better (10 trees, 7 minutes):

Single decision tree with 5-fold validation on the first 50000 observations, with all features:

Like what you read? Give Laurae a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.