Relevance of feature engineering to build a predictive model

Feature engineering is a process located between “exploratory data analysis” and “modeling” in the machine learning pipeline. It’s a fundamental step in the data science process, because with right features the job of modeling is much easier and predictive outcome will perform better.

“Coming up with features is difficult, timeconsuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.”
Andrew Ng

In this article I’ve used, for practice example, a dataset with numerical features, they are the easiest types of data to run and can be more readily fed into algorithms. You can process numerical features by rounding, binarization, binning, log transformation, scaling, interactions and they are easier to impute missing values. They can constitute floats, counts, numbers, temporal variables and spatial variables. By dates you can project a circle, trendlines, closeness to major events. By geo-location you can categorize a location, closeness to hubs, spatial fraudulent behaviour.

New York City Taxi Trip Duration — Kaggle competition

To show how feature engineering works I’ve used a dataset from Kaggle: “New York City Taxi Trip Duration”. With this competition the challenge is to build a model that predicts the total ride duration of taxi trips in New York City.

The job is divided in two steps:

  1. I’ve directly used the dataset without handling it to perform a model;
  2. I’ve predicted the outcome after handling features and pre-process activity.

To develop this activity I’ve used H2O software open source and Gradient Boosting Machine algorithm to predict the outcome using R language with the code showed at rpubs.com.

Dataset is build by 11 features and there aren’t missing values.

names(dataset)
##  [1] "id"                 "vendor_id"          "pickup_datetime"   
## [4] "dropoff_datetime" "passenger_count" "pickup_longitude"
## [7] "pickup_latitude" "dropoff_longitude" "dropoff_latitude"
## [10] "store_and_fwd_flag" "trip_duration"

Good news because there aren’t empty values and most of all are numerics, little bad news because there aren’t many variables. Machine learning techniques are powerful with lots of data, not only by rows but also by features.

pMiss <- function(dataset){sum(is.na(dataset))/length(dataset)*100}
apply(dataset,2,pMiss)
##                 id          vendor_id    pickup_datetime 
## 0 0 0
## dropoff_datetime passenger_count pickup_longitude
## 0 0 0
## pickup_latitude dropoff_longitude dropoff_latitude
## 0 0 0
## store_and_fwd_flag trip_duration
## 0 0

The evaluation metric for this competition is the Root Mean Squared Logarithmic Error and it’s calculated as the log ratio between predicted values and actual values. So smaller is the value, better is the prediction. With the first model results are not so good despite using a powerful machine learning. With feature engineering process, results have changed.

In the first model pickup coordinates are the main variables able to explain the outcome,

## Variable Importances: 
## 1 pickup_latitude 
## 2 pickup_longitude 
## 3 dropoff_longitude 
## 4 dropoff_latitude 
## 5 vendor_id 
## 6 passenger_count 

instead in the last model velocity and distance are the main variables able to explain the outcome with a weigth of 96%, from here the importance of feature engineering process.

## Variable Importances: 
## 1 velocity 
## 2 distance 
## 3 pickup_time 
## 4 traff_midday_distance 
## 5 pickup_longitude 
## 6 pickup_d 
## 7 passenger_count 
## 8 pickup_dayweek 
## 9 dropoff_longitude
## 10 pickup_latitude 
## 11 dropoff_latitude 

Distance is a variable earned using pickup and dropoff coordinates by geodist function available on “GMT” R package and velocity is calculated by the ratio between distance and trip duration.

dataset$distance <- geodist(dataset$pickup_latitude,dataset$pickup_longitude, 
dataset$dropoff_latitude, dataset$dropoff_longitude, units="km")

In this dataset there are numerical features and specifically there are both temporal and spatial features embedded in only one for both pickup trip and dropoff trip. Before distance calculation I’ve split date and time, then split date in years, days, months and formatted it in one number.

For this activity I’ve used a training set as a whole dataset:

-with the first model RMSLE was around 0,69

## H2ORegressionMetrics: gbm 
## MSE: 49685165 
## RMSE: 7048.77 
## MAE: 475.4797 
## RMSLE: 0.6907691 
## Mean Residual Deviance : 15.54497

-with the second model, RMSLE dropped at 0,14.

## H2ORegressionMetrics: gbm 
## MSE: 35921037 
## RMSE: 5993.416 
## MAE: 76.58464 
## RMSLE: 0.1426097 
## Mean Residual Deviance : 14.95484

This example show the impact of feature engineering process to improve the result of the predictive model.

The last chart shows how the deviance is reduced when trees grow.


Repository code at rpubs.com.