Predicting Taxi Fare in New York City

9 min readJan 30, 2023

Predict Taxi Fare in NY using Machine Learning Techniques

Visualization of 55,423,856 million records of trips in New York (Diego Hurtado)

Introduction

The general problem approach to predicting taxi fares in New York using machine learning techniques would involve using historical data on past taxi rides, such as pickup and drop-off locations, time of day, and weather conditions, to train a model.

The model would then be used to make predictions on future taxi fares based on input data, such as a requested pickup and drop-off location and time. Techniques such as regression analysis, decision trees, and neural networks could be used to train the model. Additionally, feature engineering and feature selection techniques may be used to identify the most relevant factors in determining taxi fares.

The goal of this approach is to create a model that can accurately predict taxi fares, which can be useful for both taxi companies and customers. That way, both taxi companies, and customers can benefit from knowing the price beforehand.

Data

The dataset is from Google Cloud Competition, hosted in partnership with Google Cloud and Coursera, the goal is to predict the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations.

The dataset size is 55,423,856 million records of trips with the fare cost, from 2009–01–01 to 2015–06–30, with 8 features: key, fare_amount, pickup_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, and passenger_count.

Data Cleaning

Clean and preprocess the data to ensure that it is in a format that can be used to train the model.

Removing outliers

The process of removing outliers of coordinates that are in the water can be performed using a combination of data analysis and spatial analysis techniques. Here are the steps to perform this process:

Identify the water bodies: First, you need to identify the water bodies that you want to remove the outliers from. This can be done by using a water body shapefile or by using a satellite imagery layer.
Extract the coordinates: Next, extract the coordinates of the data points that you want to remove the outliers from.
Perform spatial analysis: Use a spatial analysis tool, such as the “point in polygon” analysis, to determine if each coordinate is inside, outside, or on the boundary of the water bodies.
Identify the outliers: Outliers are coordinates that are located inside the water bodies. These coordinates can be identified by using the results of the spatial analysis.
Remove the outliers: Finally, you can remove the outliers from your dataset by deleting or excluding the coordinates that are located inside the water bodies.

This process can be useful for applications where you want to remove incorrect or erroneous data points from your dataset, or to ensure that your data is spatially accurate. By removing outliers in the water, you can improve the quality and accuracy of your data and avoid potential errors in your analysis.

Visualization of trips in New York (Diego Hurtado)

Density Map of trips in New York (Diego Hurtado)

Test Dataset:

A test dataset is a set of data that is used to evaluate the performance of a machine-learning model. It is typically used to verify the accuracy and robustness of the model by comparing its predictions to known outcomes. The test dataset is only used to evaluate the final model and it should be representative of the real-world data that the model will encounter in production.

Input features for the test set (about 10K rows). Your goal is to predict fare_amount for each row.

Visualization of trips of the test Dataset (Diego Hurtado)

Travel between the places with more travel distance — Diego Hurtado

Feature engineering

Identify the most relevant factors that determine taxi fares and create new features based on these factors. For example, you can calculate the distance between pickup and drop-off, or create a new feature to indicate whether it is a weekend or not.

Create new features

Creating new features that represent the interaction between two other features, such as the product of the distance and time of a taxi ride.

Time

Year
Month
Day
Hour
Weekend
Rush Hour

Distance

To calculate the distance we can use the travel distance from google API but since the dataset has millions of records the number of queries will be a lot, that's why I am using haversine distance:

Haversine Distance

The haversine distance is a measure of the great-circle distance between two points on a sphere, such as the Earth. It is often used to calculate the distance between two GPS coordinates (latitude and longitude).

The formula for the haversine distance is as follows:

d = 2 * R * asin(sqrt(sin²((lat2-lat1)/2) + cos(lat1) * cos(lat2) * sin²((lon2-lon1)/2)))

Where:

d is the distance between the two points R is the radius of the sphere (e.g., the Earth’s radius of approximately 6,371 km) lat1 and lat2 are the latitudes of the two points lon1 and lon2 are the longitudes of the two points

The haversine distance can be used to calculate the travel distance between two points on the surface of the Earth, taking into account the curvature of the Earth. This can be useful for applications such as determining the closest airport to a city, or the travel distance between two cities by air.

Labeling the coordinates of cities of New York

Labeling coordinates based on their location relative to a polygon involves determining if a given point (represented by its coordinates) is inside, outside, or on the boundary of the polygon. This process is commonly referred to as “point in polygon” analysis.

Check if the point is inside the polygon: This can be done using the “ray casting” method. A ray is drawn from the point to a point outside of the polygon. If the number of times the ray intersects with an edge of the polygon is odd, the point is inside the polygon. If the number of intersections is even, the point is outside of the polygon.

Labeling coordinates based on their location City

This process can be useful for various applications, such as mapping and spatial analysis, where it is important to determine the relationship between points and polygons. The labeling can be used to group points into different regions or to determine if a point is located inside a particular area of interest.

Trips by the hour in Manhattan — Diego Hurtado

Clustering the coordinates

Clustering the coordinates based on their locations is a process of grouping similar coordinates into clusters. This can be useful for various applications, such as spatial analysis, market segmentation, and pattern recognition.

This process can help you uncover patterns and relationships in the data that may not be immediately apparent. By grouping similar coordinates together, you can better understand the spatial distribution of your data and make informed decisions based on this information.

EDA

Exploratory Data Analysis (EDA) is an approach to analyzing and understanding data, primarily through visualizations and statistical summaries. It helps to identify patterns, anomalies, relationships, and other features in the data that may not be immediately obvious.

The goal of EDA is to gain an understanding of the data and to form hypotheses about it that can be tested in later stages of analysis. EDA is an iterative process, as insights gained from one stage can lead to additional questions and new ways of looking at the data.

Taxi Fare by the hour in Manhattan — Diego Hurtado

I created 4 categories of distances for the trips

Fare based on the number of Passengers

Machine Learning Model to predict Fare Cost

There are several machine learning models that can be used to approach the problem of predicting taxi fares in New York, including linear regression, decision trees, random forests, and neural networks.

Linear regression is a simple model that tries to find a linear relationship between the input features and the target variable (taxi fare). This model can work well for basic prediction problems, but may not capture more complex relationships in the data.

I wrote a blog about how to implement linear regression from the model Representation & Implementation From Scratch in Python

Here you can find the link.

Decision trees and random forests are more sophisticated models that can handle non-linear relationships in the data. They work by recursively dividing the data into smaller and smaller subsets based on the values of the input features until a prediction can be made for each subset.

Ultimately, the choice of model will depend on the complexity of the problem and the available data, and different models may need to be tried and compared to determine which one provides the best results.

Random Forest Regressor

I am using Random Forest Regressor because is a powerful machine learning algorithm that can be used to predict taxi fares for several reasons.

Random Forest Regressor is a suitable algorithm for predicting taxi fares because of its ability to handle non-linear relationships, missing values, high dimensional data, improved accuracy, and reduced overfitting.

Evaluate the model

Evaluating a model is the process of measuring its performance and accuracy in making predictions on new data. The goal is to determine whether the model is a good fit for the problem it was designed to solve.

Evaluation Metrics

Various metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, and others can be used to evaluate the model’s performance and accuracy in making predictions.

Visualizations

Plotting the predicted values against the actual values helps to visually assess the model’s performance

Evaluation Metrics by subcategory

Feature Importance

Feature importance in Random Forest Regressor refers to the relative contribution of each feature to the prediction of the target variable. It provides insight into which features are most important in explaining the target variable, and how they are related to each other.

Random Forest Regressor calculates feature importance by aggregating the results of many decision trees. Each tree splits the data based on the most important feature and the process is repeated for each split. The feature importance is calculated by summing up the number of times each feature is selected for splitting across all the trees in the forest.

Gini Importance

This method calculates feature importance based on the decrease in impurity (measured by the Gini index) in the target variable after each split.

Permutation Importance

This method calculates feature importance by randomly permuting the values of a feature and measuring the change in the model’s performance.

Feature importance provides valuable information that can be used to identify and remove redundant or irrelevant features, reduce overfitting, and improve the interpretability of the model. It is an important tool for understanding the relationships between the features and the target variable and can help to guide the direction of further analysis.