Machine Learning to Predict Taxi Fare — Part One : Exploratory Analysis

Data Cleaning, Visualisation, Feature Engineering

Aiswarya Ramachandran
Analytics Vidhya

--

I was learning Python for data analysis and wanted to apply the concepts on a real data set — and lo, there I was on Kaggle and found the New York Taxi Fare Prediction problem.

In this challenge we are given a training set of 55M Taxi trips in New York since 2009 in the train data and 9914 records in the test data. The goal of this challenge is to predict the fare of a taxi trip given information about the pickup and drop off locations, the pickup date time and number of passengers travelling.

In any analytics project 80% of the time and effort is spent on data cleaning, exploratory analysis and deriving new features. In this post, we aim to clean the data, visualize the relationship between variables and also figure out new features that are better predictors of taxi fare.

The Data

The data for this problem can be found on Kaggle . For purposes of this analysis I have imported only 6M rows out of the 55M rows from the training data. The fields that are present in the data are as below:

--

--