How EDA and Spikes helped me for ML Projects estimation…

Waner Miranda
Sep 8, 2018 · 4 min read

I have been working and studying with machine learning models since 2010, with weka running KNN and SVM; then Genetic Programming, RandomForests and CNN’s with my beloved python scripts. All of those scenarios matched with a big variety of datasets, from the toy ones such as IMDB, iris, and MNIST until non-curated datasets created by your clients during their life cycles of relational databases to NoSQL.

Regression Line vs Predicted Values.

“An ML project has its own mind and along with the way, measure its efforts could be a surprise…”

Every new dataset comes with the product owner telling us: “I need to predict, classify or see the data.” and then; hell breaks loose. Data scientists scream, run RandomForests or an MLP even before viewing the data and data Engineers want to remove Zeros and outliers at first glance. In that scenarios, all I saw was 1 or 2 months struggling with GridSearch on algorithms until many times the big jump in accuracy just came from a data cleaning or simply talking with the data owner (yeas the business guy, not the DBA).

“The EDA or Exploratory data analysis could be done by a set of graphs, printing the data and describing its statistics to help understand the target domain.”

Columns Described using Pandas on a Toy Dataset.

Without knowing, I felt right to talk with the business guy who created the data. Every new chat gave an insight, a new way to view the problem. Until recently I saw myself using some sort of hidden pipeline, which doesn’t have any sophisticated name, just my guts telling me to do that every time.

1. Separate around 15% of your data into a directory you won’t see until the project ends.
2. Create a notebook called EDA (Exploratory Data Analysis), print your data, create graphs (there are a lot of examples on Kaggle).
a. One easy graph that helps a lot is the histogram for the classes (it helps you see the class distribution along the dataset.
b. Talk with the Product owner and ask two things: “Which that class means, which do you feel are the better columns to work with?”
c. After 1 day playing with the data, select a good classifier with a default configuration and less prone to overfitting such as RFs, create a simple pipeline for Train and Validation, make a run and see.
d. Cross-validation and grid search here is not the main concern, you want to understand the data.
e. With this in mind look for good metrics for your problem:
I. Binary Classification with (Sensitivity and Specificity with a ROC curve), don’t be fooled by accuracy metrics.
II. Multi-class classification with Accuracy vs Recall (I like to use Recall better than anything)
III. Regression — MAE, and MSE or RMSE could lead you to a wrong view, go for R2 (variability) instead firstly, it gives you a better understanding of your algorithm.
IV. Class Histograms, Confusion Matrix, and Regression lines are the must-have for all problems.

Class Distribution Histograms

Play with it for a couple of days at least before entering the problem pipeline, understand the behavior and characteristics of your dataset.

Talk with the data Owner: Physicians, statisticians, managers, accountants, and marketers. They know the problem better you do.

Beautiful coding and a good set of dockers won’t deliver value at this point, save this for the sprint and product building.

Notebooks are for playing, use them always but don’t make them the software core.

After that Spike week, you go for the code, grab the good pieces and organize the room. Now you have a glimpse of the data, set the metrics and the model would come easier for your team.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade