Disrupting the Entrance Point to a Predictive Data Analytics

Pavel B.
10 min readApr 22, 2015

Business analytics are becoming ‘the air companies breathe and the oceans in which they swim’

Big Data, Data Analysis, Predictive Analytics are on everyone’s lips. There is a lot of ‘hype’ regarding these ‘buzz words’ and I felt like I’m missing something valuable..

In this article I’ll share some details on my ‘deep dive’ into data analytics, machine learning & I will unveil one secret weapon…

First of all I had to surround myself with seasoned professional data analysts. And kaggle.com came to my fingertips. Their challenge is to predict something based on requests from companies around-the-globe. Gained experience & money are the prizes — great motivation. Well, seems to me, I’m at the right place.

Insight #1: Data scientist do exist & they actually do something ☺

Getting started

As I’m the greatest analytic ever, I’ll choose something from the Getting Started section. Here it goes, Titanic: Machine learning from Disaster. So romantic, I definitely should check this out.

Let me admit how clear the material is shown & how easy it’s to understand.

Ok. They want me to complete the analysis on what sorts of people were likely to survive (I already know the answer — Leonardo DiCaprio. No?) & to apply machine learning to PREDICT which passengers survived the tragedy.

The next stage is data, as seen in the tutorial. Now I have a few files:

  • Train.csv (61KB)
    Row count: 891
    Columns: [PassengerID, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked]
  • Test.csv (29KB)
    Row count: 418
    Columns: Same as Train.csv, excluding [Survived]
  • 3 Python scripts (I hope I won’t be forced to code today)

I’m done! I’m stuck! What am I suppose to do next?

Basic flow

Let me Wikipedia / Google on what stands behind the model(s), the modeling, the predictive modeling & other material I currently can’t make heads nor tails of.

I’ll put it in a nutshell:

  • Model: Embodied set of assumptions concerning the generation of the observed data;
  • Modeling: Creating a model to explain existing data;
  • Predictive modeling: Creating a model that can make a prediction on new data;

Basically a predictive analytics process consists of:

  1. Defining a problem
  2. Data preparation
  3. Modeling
  4. Deployment
  5. Evaluation

I’ll cover only the first three, because the 4th & 5th are for seasoned professionals.

Insight #2: Analytics solves real world problems.. & companies gladly invest money to find solutions for these problems.

Defining a problem

The first thing is to define a business goal. The second thing is a Target.

Target (Dependent variable) — represents the output, it reflects how a combination of Features (Independent variables) affects the Target.

The output may be represented in a few ways:

1. Binary — Yes or No; Same as Boolean (Example: Survived or Not, Bought or Not, Finished or Not)
2. Regression — Number (Example: Amount of purchases, Insurance price, …)
3. Multi-class — A bit complicated to explain at this very moment

Our goal is to predict how likely people were to survive. And the Target is ‘Survived’ (Binary).

Data preparation

The common process at this stage is to prepare a Representative sampling. This stage may take ages to complete & require some skills — one does not simply take data from anywhere & predict the future. So I anticipated sleepless nights. Gladly, Kaggle took care of it (Train.csv, Test.csv), & we can proceed!

But I’m stuck again, what to do now that I have all the required data?

I respect my time & as a professional designer lacking years of data science training, I needed to find a solution that would work with me to do the heavy lifting. After some time searching and trying some products, I found a perfect solution. Starting from this moment, it’s my powerful secret weapon.

Modeling

Again, modeling means creating a model. A model, should be able to explain existing data, while a predictive model should be able to give me a prediction on NEW data.

To create it, at first I need to train a model. That’s the terminology widely used by data scientists, so don’t be afraid, don’t switch the channel.

Now it’s the moment, when Train.csv starts the game.

Get on it

Let’s start using my secret weapon:

Browse the file, select Train.csv.

I can see there. It uploaded, read & converted my data to be able to use it. Then sampled it (I suppose it took the whole data, as it’s too small to sample). Afterwards it performed some internal magic that automatically analyzed my dataset summarizing the characteristics of each feature. The next item is much more interesting than the previous: Exploratory Data Analysis (EDA). Oh, it’s a big deal, because DataRobot does it all automatically instead of doing it manually every time you start a new project. So, it’s a huge time-saver — it let me start in just a few seconds!

But, don’t let me pass through till I explain what EDA means. Primarily EDA is for seeing what the data can tell me beyond the formal modeling or hypothesis testing task. In another words, in case of DataRobot, it’s an attempt to summarize main dataset characteristics & calculate each features importance relative to the target variable. Including var type, [unique, missing, mean, SD (standart deviation), median, min, max] values of each column in a dataset.

Insight #3: Exploratory Data Analysis is made in a blink of an eye, instead of work minutes (hours?) …

A Var Type, is a type of independent variable (feature, as I recall). It can be one of the following:

1. Numeric — Any number
2. Text — Obvious
3. Boolean — True Or False, Yes or No, 1 or 0
4. Categorical — Set of values, for example Size [Extra small, Small, Medium, Large, Extra large]. Additionally: date / time, percentage & length in terms of measurements.

I’ve been able to change Feature (Var) Type, but I didn’t have to. They were set up right!

Did I mention, that I was provided with a Visual Interpretation for each feature?

It also showed me that [Name, Ticket] fields are redundant…

Now that I’m done with preparing my dataset, I can proceed to the next step. I know that Survived is my Target (Dependent Variable), so I selected it.

I clicked Push to Start, and DataRobot started to think!

There is something new to me, CV & Holdout.

  • CV stands for Cross-Validation — the process is intended to perform multiple Validations and then sum their values & then divide on Validations count. The closer final result to the first Validation is, the better. That means, that a model is well-trained & is stable on various input data.
  • Validation — regression model validation, in statistics, determining whether a model fits the data well. The process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data.
  • Holdout — a chunk of data (always hidden & unaccessible at a model training; a.k.a — doesn’t take part in model training) used to check a trained model on how well it can predict. The closer value of Holdout to Validation & Cross-validation is, the better.

While I was trying to get the idea on what stands behind CV, Validation & Holdout, DataRobot was thinking & calculating — applying machine learning algorithms, training models…

Insight #4: Machine Learning algorithms are applied to a dataset, to figure out which one is capable of providing better & more stable results.

After a few moments, our models have been computed and their performance is represented by a score.

Btw, AUC (Area Under the Curve), it’s used in classification analysis in order to determine which of the used models predicts the classes best:

  • AUC is used for the Binary classification
  • The closer AUC to 1 or o (zero) is, the better!
  • If AUC is too close to 0,5, the worse. This means that Model isn’t better than any random guess.

Let me take a look into the suggested best model for AUC metric: RandomForest Classifier (Gini).

And what I have is:

  • Blueprint
    The visualized form on what model is doing on each particular step
  • Model info
    Self-explanatory
  • Model log
    Meh, just a log…
  • Lift Chart
    Remember: The first thing I should check every time I want to estimate how good a model is!
  • Lift Table
    Same as a Lift Chart, but in a raw format
  • ROC Curve
    The visualization of an AUC. In this particular case, I can quickly see that this model has a high level of predictive power because the curve is far away from the diagonal line in middle of the graph.
  • Importances
    This graph visualizes the machine learning conclusion about the feature importance relative to the Target.
    Actually, model concluded that Sex & PClass impacts how likely passenger was to survive (and yes, I made the same conclusion).
    Model was able to come up with valid conclusion upon it’s observation.
  • Grid Search
    I found it a bit complex, so I’ll be back to it later.
  • Predicted vs Actual
    This section gives me an ability to compare how predicted by a model correlate with actual given data.
  • Model X-Ray
    I found it too hard to understand at a first day. So I skipped it.
  • Deploy Model
    Hell yeah, I want to integrate this model to my brand new website ‘Discover your destiny on Titanic, if you were not a Leonardo Dicaprio’ & make predictions blazingly fast. Or I can tightly integrate another model with a realtime bidding platform to provide me with the information is it worth to show my ad for this particular online user… Hmm…
    I’m pretty much sure it’s one of a killer feature!
  • Predict
    Finally! I found it. Now I can make a prediction. Actually, I’ve came for it.

Ok, I see. Again, Kaggle provided me with the necessary dataset. Test.csv, it’s your show time!

Result / Prediction

Two seconds later:

Downloaded Prediction.

Let me look inside:

I suppose, if a Prediction value is less than 0.5 means passenger was out of luck, and the opposite: if a value is greater than 0.5 — he\she was lucky that day.

Awesome! I’m impressed. I was provided with predictions & DataRobot didn’t asked me anything… So fluid…

— — — — — — — — — — — — — — — — — — — — — — –

Write more about the concept of Training, Validation, Holdout.

— — — — — — — — — — — — — — — — — — — — — — –

Final thoughts

I feel like, I’ve been given a speed superpower to help myself to do a routine tasks. Yet not every single aspect of it being explored.

Reality: me with a super weapon

But then, I realize myself as a monkey with a super-speed, super-rock-solid weapon in my hands — I can do only a basic stuff with unseen speed before. Still I can do it with same accuracy, not seen ever before!

Just imagining what results can achieve real data scientist with a real data science education, considering the fact I’ve been able to do my first prediction in just a day, without any education!

Next time: Visual data insights [Variable importance, Variable effects, Word clouds, Hotspots]; Creating custom models [R Studio, Python]; Learning curves & much much more.

Stay tuned!

--

--