Machine Learning to Predict Taxi Fare — Part One : Exploratory Analysis
Data Cleaning, Visualisation, Feature Engineering
I was learning Python for data analysis and wanted to apply the concepts on a real data set — and lo, there I was on Kaggle and found the New York Taxi Fare Prediction problem.
In this challenge we are given a training set of 55M Taxi trips in New York since 2009 in the train data and 9914 records in the test data. The goal of this challenge is to predict the fare of a taxi trip given information about the pickup and drop off locations, the pickup date time and number of passengers travelling.
In any analytics project 80% of the time and effort is spent on data cleaning, exploratory analysis and deriving new features. In this post, we aim to clean the data, visualize the relationship between variables and also figure out new features that are better predictors of taxi fare.
The data for this problem can be found on Kaggle . For purposes of this analysis I have imported only 6M rows out of the 55M rows from the training data. The fields that are present in the data are as below:
The next step to solve any analytics problems is to list down a set of hypothesis, which in our case are factors that will affect the cost of a taxi trip.
- Trip distance : If the distance to be traveled is more, then fare should be higher.
- Time of Travel : During peak traffic hours, the taxi fare may be higher.
- Day of Travel : Fare amount may differ on weekday and weekends
- Weather Conditions : If it is snowing, there may be lower availability of cabs and hence higher fares.
- Is it a trip to/from airport : Trips to/from airport generally have a fixed fare.
- Pickup or Drop-off Neighborhood : Fare may be different based on the kind of neighborhood.
- Availability of taxi : If a particular location has a lot of cabs available, the fares may be lower.
Data Cleaning and Exploration
In this section, we will discuss various steps used to clean the data and understand the relationship between variables and use this understanding to create better features (Refer: Introductory Jupyter notebook)
- Distribution of fare amount
We first looked at the distribution of fare amount and found that there were 262 records where the fare was negative. Since, cost of a trip cannot be negative we removed such instances from the data. Also, fare amount follows long tail distribution.To understand the distribution of fare amount better we take a log transformation after removing the negative fares- this makes the distribution close to normal
2. Distribution of Geographical Features
The range of latitudes and longitudes are between -90 to 90 and -180 to 180 respectively. But in the training data set we observed latitudes and longitudes in range of (-3488.079513, 3344.459268) which is not possible. On further exploration, we also identified a set of 114K records which had both pickup and drop-off coordinates at the Equator. Since, this data is for taxi rides in New York, we remove these rows from our analysis. Such anomalies where not found in the test data.
We can see that there is a high density of pickups near JFK and La Guardia Airport.We then looked at what is the average fare amount for pickups and drop offs to JFK, compared to all trips in the train data and observed that fare was higher for airport trips. Based on this observations we created features to check whether a pickup or a drop-off was to any one of the three airports in New York — JFK, EWR or LaGuardia
The next step was to check whether our hypothesis of fare from certain neighborhoods are higher than the rest, based on the 5 Boroughs New York city is divided — Manhattan, Queens, Brooklyn, Staten Island and Bronx, each pickup and drop off location was grouped into these 5 neighborhoods. And yes our hypothesis was right- except for Manhattan which had most of the pickups and drop offs, for every other neighborhood, there was a difference in the pickup and drop off fare distribution. Also, Queens had a higher mean pickup fare compared to other neighborhoods.
3. Distribution of Trip Distance
Using the pickup and drop-off coordinates we calculate the trip distance in miles based on Haversine Distance. Trip distance just like fare amount follows long tail distribution, we take a log transformation to make it close to normal distribution
One of our hypothesis was just the fare amount should ideally increase with trip distance. A scatter plot between trip distance and fare amount showed that though there is a linear relationship,the fare per mile (slope) was lower, and there were a lot of trips whose distance was greater than 50 miles, but fare was very low. To check if this was the case because of airport trips, we removed the airport trips and plotted the distribution. We then observed that fare per mile was higher and another small cluster with trip distance >50 miles was observed.
The next step was to see if there was a particular region where the trip distance>50 miles was observed. This showed that, there were a lot of pickups and dropoffs from lower Manhattan. This led to a new feature — pickup_is_lower_manhattan and dropoff_is_low_manhattan.
4. Distribution of Pickup date time
The first step to analyse how the fares have changed over time, is to create features like hour, day of the week, day, month, year from pickup datetime. The code to extract these features is as below
train['pickup_datetime']=pd.to_datetime(train['pickup_datetime'],format='%Y-%m-%d %H:%M:%S UTC')
As expected, over years the average taxi fare has increased.
Over months, though there have been fewer pickups from July to December, the average fare is almost constant across months
We observed that though the number of pickups are higher on Saturday, the average fare amount is lower. On Sunday and Monday though the number of trips are lower, avg fare amount is higher
The average fare amount at 5 am is the highest while the number of trips at 5 am are the least. This is because, at 5 AM 83% of the trips are to the airport.The number of trips are highest in 18 and 19 hours
Based on the features created using this Exploratory Analysis, the baseline model using XGBoost scored a RMSE of 3.03760 on the public leaderboard, which is in the Top 15 percentile. The code for this post can be found here.
In the next part of this article, we will see how we can use the features identified using this exploratory analysis to create machine learning models and understand how to evaluate the models.I hope you found this article useful and helped you build confidence to solve this challenge on Kaggle. As always, all discussion and suggestions are welcome.