Big Intelligence ; Data Classification

Companies set their ROI on Big Data through repurposing the data stored in their lakes and leveraging it in an aim to solve new problems and answer some pressing questions.

Machine learning is ideal for BI analysis when the volume of data is too large and complex for comprehensive analysis. The range of opportunities hidden in your data is too big, and only by exploiting the opportunities that you can get a high ROI on your Big data.

Through machine learning you can;

  • Discover hidden structures in your data through clustering
  • Predict values based on collected and current data (Regression)
  • Predict categories in your data based on collected and current data (Classification)
  • Detect exceptions and hidden anomalies

This post will be covering the prediction of categories in your collected and current data; classification. As an example, you might need this kind of analysis in leveraging your marketing plans, by targeting those customers whom you believe to respond well to your ads.

As an example we will take a data set consisting of individuals’ income records. We will be using the ML Studio for this exercise.

If we wish to forecast income classes based on age, education, sex and work hours, we can opt to select a binary classification tree algorithm.

Binary classification algorithm works by creating trees based on the features presented in the data.

We will start by importing and preparing the income data set sample from the ML Studio’s left menu.

Data Preparation

Before training our model, we will have to prepare our data set for the training.

This involves,

  • Cleaning missing records by deleting rows with empty value or columns.
  • Projecting columns, by selecting the exact features needed in this exercise (age, education, sex ,work hours and income)
  • Splitting the data; 60% for training the model and 40% for testing it.

Training the Model

This involves using the 60% of the data rows in training the model using a binary classification algorithm; Two-Class Boosted Decision Tree which is well known for its accuracy and fast training.

The algorithm will be using a single parameter for the classification; i.e. the income.

Scoring the Model

As a result of the scoring, the records will be labeled with the score label and scored probabilities. By taking a simple look at the records, you can easily detect or get a feel of the error ratio introduced by this algorithm,

Evaluating the Model

To get an accurate evaluation of your predictions, you may wish to evaluate your model to check both the accuracy and precision ratio.

The overall training experiment would look like a simple map of dragged components.

Monetizing your Experiment

Next you can streamline your experiment by converting your training model into a predictive experiment in order to start using it to score new data, and deploy it as a web service to integrate your newly built model within your system or app, where you send data to the model and receive the scoring.