New York City Taxi Fare Prediction

Published in

Analytics Vidhya

7 min readAug 21, 2019

Well, since the time when I was exposed to this dense network of technologies, I had many questions that required answers. One of them was how my taxi fare was already decided whenever I booked a ride on OLA or Uber. Not only did they give fare price but it frequently changed according to time and traffic. I think I got my answer, now it’s time to explain to you all how do we do this!

So, as usual, I was fetching some database in Kaggle for some fun and to learn more. One beauty caught my attention where the main task of that database was to predict the fare of the rider. The twist is that data is literally huge consisting of 55 million entries. Now as I was very curious to learn and explore more about this dataset, I decided to cut some corners and here we go:

By the way, you can find dataset here

Import Relevant Libraries

import numpy as np 
import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

As mentioned earlier, the given dataset consists of 55 million entries and it was really not possible to load all the data once, so I decided after referring some of the notebooks to take 20 % which is 11 million entries. It would be sufficient for training purposes, though we can we can take any sample of data but 55 million is too large. The function which I used is nrows. For those who are really interested in loading the whole tank, here is how you can do it.

Loading Dataset

train = pd.read_csv("../input/new-york-city-taxi-fare-prediction/train.csv", nrows = 1000000)test = pd.read_csv("../input/new-york-city-taxi-fare-prediction/test.csv")

The training set consists of 1000000 rows and 8 columns while the Testing set consists of 9914 entries and 7 columns.

Well, after we have loaded our guns, it’s time to aim it towards the target and make some adjustments or to preprocess our data.

Data Pre-processing

Without making any changes we should find out the number of null entries in the dataframe for any further confusion. So what are we waiting for? Let’s begin the cleaning phase!

train.isnull().sum()

So we have a negligible number of null entries. It’s more convenient to eliminate them.

train = train.dropna(how = 'any', axis = 'rows')

test.isnull().sum()

I am so clumsy, in a hurry of finding all the null entries I completely forgot to have a glance at our dataset to find out more about the dataset and it’s attributes

train.head()

you will also have a column of passenger_count

So as seen from the dataframe, there are 7 independent columns and one dependent column which is fare_amount.

Let’s begin analyzing our fare_amount to see if we find any outliers or extreme values in the data frame.

train[‘fare_amount’].describe()

The minimum value of fare_amount is -44$ (seems unrealistic).

Fares are always positive values, so let’s just drop all the negative values in the training data.

train = train.drop(train[train['fare_amount']<0].index, axis=0)

After going through fares, our next stop can be the longitudes and latitudes. This I think will be the most important attribute in estimating the fares as the larger the difference between a drop and pick up point the more the fare.

However, before analyzing the distance let’s just set some constraints.

Though I am not an expert in geology, after a search I found something:

Range of latitudes are from 0–90 and longitudes range from 0-180

Removing all the invalid locations we will get some useful data which can be used in calculating the distance between two points.

train = train.drop(((train[train['dropoff_latitude']<-90])|(train[train['dropoff_latitude']>90])).index, axis=0)train = train.drop(((train[train['dropoff_longitude']<-180])|(train[train['dropoff_longitude']>180])).index, axis=0)

After referring some kernels and going through discussion what I found is we need to add one more column in our dataframe to get some major insights in predicting the fare.

Instead of using the longitudes and latitudes of pick up and drop point individually let’s create a new column named ‘diff_lat’ and ‘diff_long’ which will be having the absolute difference between the records.

It is going to be very useful — though it was not my idea whoever thought this was a genius. Since, instead of using distance calculator or Euclidian distance which could slow down our training process, we could create a column of the difference of pick up and drop point longitude and latitude because difference of 1 between any two longitudes or latitude means 66 miles. And what’s the total area coverage of New York?

train['diff_lat'] = ( train['dropoff_latitude'] - train['pickup_latitude']).abs()
train['diff_long'] = (train['dropoff_longitude'] - train['pickup_longitude'] ).abs()

As observed from the table, all the values are between 0 and 1 and it has to be, as taxis are mostly used for traveling within the city and as mentioned above the difference of 1 is equivalent to 66 miles.

After going through this, I have a curiosity to check whether are there some combinations which have a difference of more than 1, let’s see

plot = train.iloc[:2000].plot.scatter('diff_long', 'diff_lat')

It’s time to eliminate our outlier :

train = train[(train.diff_long < 5.0) & (train.diff_lat < 5.0)]

For now, our most important attribute is done, but one more attribute is left which needs to analyzed and filtered: passenger_count

train['passenger_count'].describe()

Maximum: 208? , was it a taxi or an airplane?

I think we need to get rid of this outlier too

train = train.drop(train[train['passenger_count']==208].index, axis = 0)

Linear Regression

I recently unlocked a new power after referring one of the kernels and I am going share it with you here:

We are going to use numpy’s lstsq library function to find the optimal weight column w. But what is the optimal weight?

Here’s the simple math is which is integrated in numpy’s lstsq function

So let’s say instead of A we have our X (set of all dependent variables) and b is our y (set of all independent variables) so x and y will be our optimal weights or w.

It will be very useful and efficient in increasing our model’s accuracy and efficiency. To implement the above formula we first need to convert our dataframe into a matrix and also add a column of 1(ones) in our matrix for the constant term. Let’s create a function to implement all the above constraints:

def get_input_matrix(df):
    return np.column_stack((df.diff_long, df.diff_lat, np.ones(len(df))))train_X = get_input_matrix(train)
train_y = np.array(train['fare_amount'])

Finally, we are ready for the last step of our training phase:

(w, _, _, _) = np.linalg.lstsq(train_X, train_y, rcond = None)
print(w)

If we want just want to get a hint of what is this than by just looking at the positive values we can tell that as our distance increases, fare amount increases as well. So we are going in the right direction!

And for those who have still not understood how we got this w, I am going to explain it one more time as it is really important in our training part.

X.w = y, just consider the third column of X as all 1’s so, we have X which we will be having all the diff_lat, diff_long, 1’s and y will be having fare_amount, so our w will be a having values x, y, z as

(3 X 3) . (3 X 1) = (3 X 1).

Testing Phase

So we have trained our model, now it’s time for results so let’s check how our model performs on the testing dataset.

#get_input_matrix is the function which we have made above.To convert dataframe to matrix.test_X = get_input_matrix(test)
test_y = np.matmul(test_X, w).round(decimals = 2)

Results

submission = pd.DataFrame()
submission["key"] = test.key
submission["fare_amount"] = test_y
submission.to_csv('submission.csv', index = False)

We are getting this from just 20% of the data.

Scope for Improvements

Well, the bigger the better. Meaning this was just the result of 20% of the data. If we increase the proportion of the data results will increase definitely but computing speed will decrease surely!

Well, that’s it for now. If you have any suggestions you can mention them below. Being a beginner, I am always open for changes and improvements and would really motivate me to keep on writing!

And here’s the link for the complete code for reference.