Predicting Ride-Sharing Platform Pricing in New York City with Deep Learning

AllenZhou
AI4SM
Published in
22 min readOct 26, 2023

Yanhao Li, and Wangwenzan Zhou as part of course project of ECE1724H: Bio-inspired Algorithms for Smart Mobility. Dr. Alaa Khamis, University of Toronto, 2023.

Image credit:https://quotecatalog.com/

1 Introduction

In the context of smart mobility systems, especially within densely populated urban areas with services like Uber or traditional taxis, this pricing mechanism serves as an ill-structured optimization problem. The preliminary goal of the course project is to construct a dynamic pricing model that utilizes the dynamics of rides within ride-sharing platforms, incorporating an equilibrium analysis that encompasses the incentives of both passengers and drivers, along with the platforms’ objectives. To narrow the research scope, the course project and this article will exclusively investigate the pricing within ride-sharing platforms operating in New York City (NYC).

The objective of this article is to analyze and visually examine relevant datasets in order to discover potential patterns, investigate relationships, and provide a more solid comprehension of the issue that will be addressed in the course project. Additionally, this article aims to provide a preliminary pricing prediction model based on the dynamics of rides on ride-sharing platforms, while trying to achieve reduced model complexity and rapid adaptation to different grid topologies.

2 Problem Characterization

2.1 Problem Definition

The problem is to predict how much passengers will pay (not including tips) for a ride when using a ride-sharing platform in New York City.

2.2 Importance

Predicting ride-sharing platform pricing in NYC could be an important problem with wide-reaching implications for consumers, service providers, and urban planning. Predicting the fare that passengers will pay for a ride on a ride-sharing platform in NYC may help enable accurate pricing, budgeting, and planning for drivers, passengers and platform operators, and may help the development of dynamic pricing strategies of these platforms. Specifically, these fare predictions may provide transparency and cost estimation to passengers, guarantee fair pay and compensation for drivers, and optimize the pricing strategy for the platform.

2.3 Challenging Aspects

In theory, fluctuations in demand and supply lead to rapid price changes, presenting a challenge for developing models capable of keeping pace with these real-time variations. Ride sharing platforms could generate extensive data, including historical trip records, user behavior, platform-specific information, and external variables such as weather and special events. The extensive amount and large variety of data may make it challenging to analyze and integrate the diverse data sources into the prediction model. The presence of external factors, such as special events, traffic incidents, or unfore seen demand surges, may significantly influence pricing and introduce an element of unpredictability into the model, ne cessitating effective handling of outliers. Achieving a balance between model accuracy and complexity is challenging as well, especially given the potential complexity introduced by large amount of geospatial data, as overly complicated models may compromise computational efficiency for efficient real-time predictions.

3 Related Work

Numerous studies have focused on the benefits of implementing dynamic pricing in ride-sharing platforms such as Uber, Lyft, and Via, leading to the development, examination, and evolution of various pricing models and strategies.

Yan et al. [9] explores the significant impact of ride-sharing platforms on urban transportation, focusing on the evolving research and implementation of advanced matching and dynamic pricing algorithms, highlighting their crucial role in reducing waiting times for both riders and drivers, and discussing the potential benefits of jointly optimizing these techniques while addressing practical challenges and future research directions.

Banerjee et al. [1] examines dynamic pricing strategies for ride-sharing platforms, focusing on the need for economic models that account for the incentives of both drivers and passengers, and stochastic models to capture the dynamic nature of the system. The main findings from the research highlight that the effectiveness of dynamic pricing lies in its resilience to uncertainties in system parameters, rather than achieving optimal performance compared to static pricing models.

New methods and models, such as a novel Graph Neural Network (GNN) framework and Deep Reinforcement Learning, are being actively used to address dynamic pricing and real-time market challenges. For example, the GNN framework is used to improve efficiency and reliability in solving the optimal power flow (OPF) problem in real-time electricity markets, as demonstrated through numerical tests [7]. In the context of the electric vehicle (EV) industry, Deep Reinforcement Learning is employed to tackle issues like unbalanced utilization of fast charging stations (FCSTs) and long charging wait times. This approach proposes a dynamic pricing strategy that optimizes FCST operation profit, reduces road congestion, and enhances user satisfaction by considering real-time charging price changes and predicting traffic flow variations [4].

Deep learning algorithms outperform traditional methods in processing geospatial data with improved precision. For instance, deep learning models excel in predicting ride-sourcing demand for various origin-destination pairs, effectively handling spatial and temporal complexities. Studies using Manhattan’s for-hire-vehicle datasets have demonstrated that this approach performs better than existing methods [5]. Furthermore, Graph Neural Networks (GNN) have successfully predicted taxi-out times at airports as part of the Airport Collaborative Decision Making (ACDM) initiative. In comparative tests, both GNN and Gradient Boosted Machines (GBM) models outperformed standard FAA and EUROCONTROL methods, with GNN showing a slight edge in accuracy.

4 Problem Datasets

The primary data is vehicle trip data and the corresponding geospatial region division. A major data source is the Taxi and Limousine Commission (TLC), which is the authority tasked with licensing and overseeing various types of transportation services in the city, including Medallion (Yellow) taxi cabs, for-hire vehicles (black cars, community-based liveries, and luxury limousines), commuter vans, and paratransit vehicles [2]. Considering the influence of weather conditions on fares, it is essential to include the weather data as well. In order to align trip and weather data, datasets from the same year are selected. A copy of all datasets used for this midterm article can be found in: https://drive.google.com/drive/folders/1OUrBZZwuU90gGuhH0-Dx2NHEb2dZjuwB?usp=sharing

4.1 NYC for-hire vehicles (FHV) trip data

This 2021 dataset includes detailed records from high-volume for-hire vehicle (HVFHV) bases like Uber, Lyft, and others, featuring each trip’s date, time, pickup and drop-off zones, mileage, duration, fares, tolls, passenger sharing status, and wheelchair accessibility.

4.2 Taxi zone data

The NYC taxi zone dataset and shape files [3] include the detailed geometric information of NYC taxi zones, corresponding to the pickup and drop-off zones in the FHV trip data. These taxi zones are designed to approximately align with the Neighborhood Tabulation Areas (NTAs) established by the NYC Department of City Planning. The purpose of these taxi zones is to provide a convenient way to approximate neighborhoods, allowing people to distinguish the specific neighborhood where a passenger is initially picked up and the neighborhood to which they are subsequently dropped off.

4.3 Weather data

The 2021 NYC weather dataset [8] contains comprehensive weather details, including, but not limited to, temperature, dew point, humidity, precipitation, probability of precipitation, type of precipitation, snowfall, ground snow depth, wind speed, wind direction, visibility, and UV index.

5 Problem Formulation and Modeling

In our model, we use weighted and directed graphs based on NYC taxi zones instead of the detailed open street map of NYC. Each node in the graph represents a taxi zone, with features like weather (temperature, precipitation, wind speed, visibility) and neighborhood statistics (population, road density). Edges between nodes signify trips in both directions between zones, weighted by trip details like date, time, mileage, duration, base fare, tolls, and sales tax.

Graph Collection

G = (V , E)

where V is the set of nodes and E is the set of directed edges.

where each v represents a distinct taxi zone (node).

where each edge represents a directed trip between nodes.

Node Feature Matrix:

where Xi represents the feature vector of node vi​ and d is the number of features.

which includes features like shape_length, shape_area, LocationID, price, weather and demographics.

Edge Feature Matrix:

GNN modeling

A graph neural network will be developed and utilized. This method does not only accommodate directed and weighted graphs but also retains and utilizes edge information during the GNN propagation, which is essential for many real-world graphs relying heavily on connection topology. So, to encapsulate the entire process from feature nomalization to final prediction into a single equation.

where:
Y represents the output dynamic pricing.
G is the graph collection mentioned before.
Normalize(X) is the normalization function applied to node and edge features.

Computing the mean μX and standard deviation σX of the node features X to realize normalization of the node features:

and if edge features are used before, we compute similarly μX^E , and nomalize the edge features:

Then, the initial node embeddings H(0) are set to the normalized feature vectors:

There is , of course, a more intuitive formula:

where W0 and b0 are the weight matrix and bias for the output layer. And b0 is just as same as the Θ in the entire fomulation before.

Mean Squared Error (MSE) Loss function for regression:

And the GNN will be trained to minimize this loss. This detailed math formulation of using GNN to model and predict dynamic pricing of NYC. By leveraging the spatial and feature-rich structure of the graph, the model can learn intricate relationships that influence pricing in different location, datetime, weather, and other varying conditions.

6 Exploratory Spatial Data Analysis (ESDA)

The notebook containing detailed Python codes and generated plots, along with necessary data, can be found in:
https://drive.google.com/drive/folders/1OUrBZZwuU90gGuhH0-Dx2NHEb2dZjuwB?usp=sharing

6.1 Data preprocessing

In preprocessing our trip, weather, and taxi zone data, we structured the data for analysis and handled missing values carefully. We removed columns with many missing values, like the base number and arrival time in the FHV trip dataset, and the “severerisk” in the weather dataset, as they had limited predictive value. For columns with few missing values, we filled gaps based on related data, such as deducing missing “preciptype” in the weather data from the “snow” column.

The NYC taxi zone dataset, being smaller and unique, kept all records, labeling missing values in “service_zone” and “Zone” as “Unknown.” We also simplified date and time data in the FHV dataset into categories like weekdays/weekends and morning/afternoon/evening/night for better analysis.

6.2 Visualization, computation and analysis

After preprocessing, we will perform visualization, computation, and analysis to identify patterns, explore relationships, and set the stage for modeling. Our process includes:

Geographic Visualization: Retrieving the boundaries of NYC taxi zones and calculating their centroids for a visual map of the zones.

Data Analysis: Using the cleaned FHV trip dataset, we’ll calculate the average trip duration and distance for each zone.

Data Alignment: By aligning the taxi zones with trip data using the pickup location ID, we can analyze the pickup frequency in each zone.

Heatmap Creation: To visually represent pickup frequency, we’ll construct a heatmap (like Figure 1), where darker shades indicate zones with more pickups.

# Merge the tripdata with taxi_zones_gdf
cols_to_merge = ["PULocationID"]
merged_data = tripdata01[cols_to_merge].merge(taxi_zones_gdf, left_on="PULocationID", right_on="LocationID")

# Calculate the number of trips per zone
pickup_counts = merged_data["PULocationID"].value_counts().reset_index()
pickup_counts.columns = ["LocationID", "pickup_count"]

# merge pickup_count and taxi_zones_gdf
taxi_zones_gdf = taxi_zones_gdf.merge(pickup_counts, on="LocationID", how="left")
taxi_zones_gdf["pickup_count"].fillna(0, inplace=True)

# Convert the pickup_count to float
taxi_zones_gdf['pickup_count'] = taxi_zones_gdf['pickup_count'].astype(float)

# Plot the pickup_count
fig, ax = plt.subplots(figsize=(10, 10))
taxi_zones_gdf.plot(ax=ax, column="pickup_count", legend=True, cmap="YlOrRd", edgecolor="black")
ax.set_title("Pickup Heatmap in NYC")
plt.show()
Figure 1: Pick-up heatmap
# Convert the pickup_datetime and weather datetime to date
tripdata01['pickup_datetime'] = pd.to_datetime(tripdata01['pickup_datetime']).dt.date
weather01['datetime'] = pd.to_datetime(weather01['datetime']).dt.date

# merge the data set
merged_data = tripdata01.merge(weather01, left_on="pickup_datetime", right_on="datetime")

# Filter the data set for temperatures above zero
above_zero = merged_data[merged_data['temp'] > 0]

# Calculate pickup counts for above zero temperature
above_zero_counts = above_zero["PULocationID"].value_counts().reset_index()
above_zero_counts.columns = ["LocationID", "pickup_count_above_zero"]

# Merge the counts with the taxi_zones_gdf
taxi_zones_gdf_above_zero = taxi_zones_gdf.merge(above_zero_counts, on="LocationID", how="left")
taxi_zones_gdf_above_zero["pickup_count_above_zero"] = taxi_zones_gdf_above_zero["pickup_count_above_zero"].astype(float)

# Plot the map for Temperature > 0°C
fig, ax = plt.subplots(figsize=(10, 10))
taxi_zones_gdf_above_zero.plot(ax=ax, column="pickup_count_above_zero", legend=True, cmap="YlOrRd", edgecolor="black")
ax.set_title("Heatmap for Temperature > 0°C")
plt.show()
Figure 2: Pick-up heatmap (given temperature over 0)
# Filter the merged_data set for temperatures below or equal to zero
below_zero = merged_data[merged_data['temp'] <= 0]

# Calculate pickup counts for below or equal to zero temperature
below_zero_counts = below_zero["PULocationID"].value_counts().reset_index()
below_zero_counts.columns = ["LocationID", "pickup_count_below_zero"]

# Merge the counts with the taxi_zones_gdf
taxi_zones_gdf_below_zero = taxi_zones_gdf.merge(below_zero_counts, on="LocationID", how="left")
taxi_zones_gdf_below_zero["pickup_count_below_zero"] = taxi_zones_gdf_below_zero["pickup_count_below_zero"].astype(float)

# Plot the map for Temperature ≤ 0°C
fig, ax = plt.subplots(figsize=(10, 10))
taxi_zones_gdf_below_zero.plot(ax=ax, column="pickup_count_below_zero", legend=True, cmap="YlOrRd", edgecolor="black")
ax.set_title("Heatmap for Temperature ≤ 0°C")
plt.show()
Figure 3: Pick-up heatmap (given temperature less than or equal to 0)
# Filter the data set for temperatures above zero
snow = merged_data[merged_data['snow'] > 0.5]

# Calculate pickup counts for above zero temperature
snow_counts = snow["PULocationID"].value_counts().reset_index()
snow_counts.columns = ["LocationID", "snow_count"]

# Merge the counts with the taxi_zones_gdf
taxi_zones_gdf_snow = taxi_zones_gdf.merge(snow_counts, on="LocationID", how="left")
taxi_zones_gdf_snow["snow_count"] = taxi_zones_gdf_snow["snow_count"].astype(float)

# Plot the map for Temperature > 0°C
fig, ax = plt.subplots(figsize=(10, 10))
taxi_zones_gdf_snow.plot(ax=ax, column="snow_count", legend=True, cmap="Blues", edgecolor="black")
ax.set_title("Heatmap for Snow")
plt.show()
Figure 4: Pick-up heatmap (given temperature less than or equal to 0)

We merged taxi pickup timestamps with weather data to analyze how weather affects taxi ride distribution. We created four heatmaps based on two conditions: temperature (above or at/below 0°C) and snowfall (more or less than 0.5 inches). These visualizations help us understand how temperature and snowfall influence taxi pickups. Each heatmap shows the number of pickups in different taxi zones under varying weather conditions, revealing patterns of higher taxi activity in certain areas during warmer days or snowy weather. This approach allows us to identify which zones see more ridership in specific weather scenarios.

# Filter the data set for temperatures above zero
not_snow = merged_data[merged_data['snow'] <= 0.5]

# Calculate pickup counts for above zero temperature
not_snow_counts = not_snow["PULocationID"].value_counts().reset_index()
not_snow_counts.columns = ["LocationID", "not_snow_count"]

# Merge the counts with the taxi_zones_gdf
taxi_zones_gdf_not_snow = taxi_zones_gdf.merge(not_snow_counts, on="LocationID", how="left")
taxi_zones_gdf_not_snow["not_snow_count"] = taxi_zones_gdf_not_snow["not_snow_count"].astype(float)

# Plot the map for Temperature > 0°C
fig, ax = plt.subplots(figsize=(10, 10))
taxi_zones_gdf_not_snow.plot(ax=ax, column="not_snow_count", legend=True, cmap="Blues", edgecolor="black")
ax.set_title("Heatmap for Not Snow")
plt.show()
Figure 5: Pick-up heatmap (given temperature less than or equal to 0)

In our data preprocessing, we categorized taxi pickup dates and times into “part of the week” (weekday or weekend) and “part of the day” (morning, afternoon, evening, night). This categorization helps us analyze taxi demand across NYC’s zones during different times, using heatmaps for comparison. In these heatmaps, shown in Figure 6, darker shades represent more pickups in a zone, while lighter shades indicate fewer pickups, with a color bar providing the number of pickups.

# Convert the 'pickup_datetime' back to datetime format to extract the hour
tripdata01['pickup_datetime'] = pd.to_datetime(tripdata01['pickup_datetime'])

# Define a function to assign a time period based on hour
def assign_time_period(hour):
if 7 <= hour < 12:
return "Morning"
elif 12 <= hour < 18:
return "Afternoon"
elif 18 <= hour < 24:
return "Evening"
else:
return "Night"

# Apply the function to the dataframe
tripdata01['time_period'] = tripdata01['pickup_datetime'].dt.hour.apply(assign_time_period)

# Now, for each time period, we will calculate the pickup counts and plot the heatmap
time_periods = ["Morning", "Afternoon", "Evening", "Night"]
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))

for time_period, ax in zip(time_periods, axes.ravel()):
subset = tripdata01[tripdata01['time_period'] == time_period]
counts = subset["PULocationID"].value_counts().reset_index()
counts.columns = ["LocationID", "pickup_count"]

# Merge counts with taxi_zones_gdf
temp_gdf = taxi_zones_gdf.merge(counts, on="LocationID", how="left")
temp_gdf["pickup_count"] = temp_gdf["pickup_count"].astype(float)

temp_gdf.plot(ax=ax, column="pickup_count", legend=True, cmap="YlOrRd", edgecolor="black")
ax.set_title(f"Heatmap for {time_period}")

plt.tight_layout()
plt.show()
Figure 6: Pick-up heatmap (given different date and time intervals)

The heatmap analysis in Figure 6 shows different taxi pickup patterns during the day. Mornings have more central pickups, likely for work commutes. Afternoons remain busy centrally, perhaps due to shopping or work travel. Evenings see high central demand and increased pickups in other areas, indicating post-work travel. Late nights have fewer central but more pickups in other zones, suggesting evening activities or returns home. Zones that are dark throughout the day indicate a constant, high demand for rides.

6.3 Graph

As mentioned in the Problem Formulation and Modeling section, weighted and directed graphs will be utilized. A graph can be created with each zone as a node. Each NYC taxi zone’s centroid is computed and used to locate the corre sponding node.

# Create a graph with each zone as a node and each zone's center as the node's location.
G = nx.Graph()
for index, row in taxi_zones_gdf.iterrows():
G.add_node(row['LocationID'], pos=row['center'].coords[0], label=row['zone'])

# Add edges to every pair of nodes
for i in range(len(taxi_zones_gdf)):
for j in range(i + 1, len(taxi_zones_gdf)):
G.add_edge(taxi_zones_gdf.iloc[i]['LocationID'], taxi_zones_gdf.iloc[j]['LocationID'])

# This method of plotong the graph is from ChatGBT

from scipy.spatial import Delaunay

# Extract the centers of the taxi zones
centers = [center.coords[0] for center in taxi_zones_gdf['center']]
location_ids = taxi_zones_gdf['LocationID'].tolist()

# Compute the Delaunay triangulation
tri = Delaunay(centers)

# Create a new graph
G_delaunay = nx.Graph()

# Add nodes to the graph
for index, row in taxi_zones_gdf.iterrows():
G_delaunay.add_node(row['LocationID'], pos=row['center'].coords[0], label=row['zone'])

# Add edges to the graph based on the Delaunay triangulation
for simplex in tri.simplices:
G_delaunay.add_edge(location_ids[simplex[0]], location_ids[simplex[1]])
G_delaunay.add_edge(location_ids[simplex[1]], location_ids[simplex[2]])
G_delaunay.add_edge(location_ids[simplex[2]], location_ids[simplex[0]])

# Draw the graph
fig, ax = plt.subplots(figsize=(12, 12))
nx.draw(G_delaunay, nx.get_node_attributes(G_delaunay, 'pos'), node_size=5, node_color='blue', edge_color='gray', alpha=0.5, ax=ax)
ax.set_title("NYC Taxi Zones with Centroids using Delaunay Triangulation")
plt.show()

Such graph abstraction can have a simple visualization as Figure 7.

Figure 7: Graph abstraction

7 Building Model

7.1 Data preprocessing

The foundational step in constructing a predictive model for dynamic taxi pricing involves the integration and preparation of multiple data sources. These data sources include trip records (tripdata01_new), taxi zones (taxi_zones_gdf), and weather information (weather01_new). The trip data contains essential information about each taxi trip, including pick-up and drop-off location IDs (PULocationID, DOLocationID), while the taxi zones dataset encompasses the geographical and administrative details of New York City’s taxi regions. Weather data contributes environmental factors that may significantly impact taxi demand and pricing.

The initial phase of the data integration process involves creating distinct copies of the taxi_zones_gdf dataset, differentiated as pick-up and drop-off zones, with relevant columns renamed using the add_prefix method to ensure clarity. This renaming strategy facilitates subsequent merging operations. We then merge the trip data with both pick-up (taxi_zones_gdf_pu) and drop-off (taxi_zones_gdf_do) zone datasets based on their respective location IDs. The final integration step combines the merged trip and zones data with the weather dataset, using the pickup_datetime and datetime columns as keys.

Upon merging, we eliminate redundant columns such as PU_LocationID, DO_LocationID, and pickup_datetime. We also conduct a preliminary null value analysis using isnull().sum() to ensure the completeness of the data.

7.2 Graph Construction

With the integrated dataset, we construct a graph representation of New York City’s taxi network using networkx, a powerful library for graph analysis. Each node in the graph represents a taxi zone, while the edges symbolize actual trips between these zones. The node attributes encompass both geographical properties (e.g., shape_length, shape_area, zone, borough) and metadata (objectid, center, geometry). Edge attributes are rich in detail, including not only the trip’s fare and duration but also weather conditions at the time of the trip, such as temperature (temp), precipitation (precip), and visibility (visibility).

# Create an empty graph
G = nx.Graph()

# Add nodes with all information from taxi_zones_gdf
for index, row in taxi_zones_gdf.iterrows():
G.add_node(row['LocationID'],
objectid=row['OBJECTID'],
shape_length=row['Shape_Leng'],
shape_area=row['Shape_Area'],
zone=row['zone'],
borough=row['borough'],
geometry=row['geometry'],
center=row['center'])

# Add edges representing trips, with weather information as edge attributes
for index, row in final_merged_data.iterrows():
G.add_edge(row['PULocationID'], row['DOLocationID'],
temp=row['temp'],
precip=row['precip'],
snow=row['snow'],
windspeed=row['windspeed'],
visibility=row['visibility'],
total_fare=row['total_fare'],
trip_time=row['trip_time'],
trip_miles=row['trip_miles'],
pickup_time=row['pickup_time'],
weekday=row['weekday'],
shared_match_flag=row['shared_match_flag'],
wav_match_flag=row['wav_match_flag'])
len(G.nodes)
# afte check the rows in final_merged_data where PULocationID or DOLocationID is 57,103,104,105,199 and drop them b
# visualize the graph
# set the figure size
plt.figure(figsize=(20, 20))

# set the layout
pos = nx.spring_layout(G, k=1)
nx.draw(G, pos, with_labels=True, node_color='skyblue', node_size=10, edge_color='gray')

# display the graph
plt.title("NYC Taxi Zones with Trips")
plt.show()

7.3 Define GCN Model
In our research, we’ve adopted a graph-based approach that aligns seamlessly with our data’s inherent structure. Yet, we faced a conundrum with the standard Graph Convolutional Network (GCN), as it predominantly processes node-centric features, posing a challenge for effective target feature prediction when these are edge-centric. To navigate this, we pivoted towards a Message Passing Graph Neural Network (MP-GNN) framework. The essence of MP-GNN lies in its capacity to enable nodes to exchange information via messages along their connected paths. This dynamic allows for the aggregation of neighborly data, adeptly capturing edge-based attributes within our network.

Confronted with the substantial volume of our dataset and the limitations of our computational resources, we devised an innovative approach to transpose edge information onto the nodes. Here, each node is conceptualized as a tensor, encompassing 258 distinct features (originally numbered from 1 to 263, adjusted for five previously excluded nodes). These features are reflective of varied trip data between nodes, including aspects like fare, distance, and meteorological conditions.

Addressing the variance in trip frequencies among nodes presented a unique challenge. Our solution entailed two distinct methodologies. The initial method involves condensing each node feature into a vector, encapsulating the average metrics of its respective trip data. While this method is computationally efficient, it might trade-off some model accuracy. Alternatively, our second method involves augmenting features with lesser trip frequencies to match those of the most frequented node, ensuring a balanced dataset.

These tailored strategies are designed to harness the potential of MP-GNN effectively in our context, taking into account the specific challenges posed by the size of our dataset and the constraints of our hardware infrastructure.

# Create a map mapping LocationID to 0-based index
location_to_index = {loc_id: idx for idx, loc_id in enumerate(sorted(taxi_zones_gdf['LocationID']))}

# Prepare goal matrix
target_matrix = np.zeros((num_destinations, num_destinations))
for loc_id in taxi_zones_gdf['LocationID']:
for dest_id in taxi_zones_gdf['LocationID']:
if (loc_id, dest_id) in grouped.index:
# Use mapping to index
loc_idx = location_to_index[loc_id]
dest_idx = location_to_index[dest_id]
target_matrix[loc_idx, dest_idx] = grouped.loc[(loc_id, dest_id), 'total_fare']

target_tensor = torch.tensor(target_matrix, dtype=torch.float)
# Build the graph
G = nx.Graph()

# Add nodes
for loc_id in taxi_zones_gdf['LocationID']:
node_index = location_to_index[loc_id]
G.add_node(node_index, features=node_features_tensor[node_index])

# Add edges
for loc_id in taxi_zones_gdf['LocationID']:
for dest_id in taxi_zones_gdf['LocationID']:
if (loc_id, dest_id) in grouped.index:
G.add_edge(location_to_index[loc_id], location_to_index[dest_id])
# visualize the graph
plt.figure(figsize=(20, 20))
pos = nx.spring_layout(G, k=1)
nx.draw(G, pos, with_labels=True, node_color='skyblue', node_size=10, edge_color='gray')
plt.title("NYC Taxi Zones with Trips")
plt.show()

Then, we define the GCN model, tain, test, and validate the data, and train the model. And get the loss visualization.

# define the GCN model
class GCN(torch.nn.Module):
def __init__(self, num_features):
super(GCN, self).__init__()
self.conv1 = GCNConv(num_features, 16)
# output dim is the number of destinations
self.conv2 = GCNConv(16, num_destinations)

def forward(self, data):
x, edge_index = data.x, data.edge_index
x = self.conv1(x, edge_index)
x = F.relu(x)
x = F.dropout(x, training=self.training)
x = self.conv2(x, edge_index)
return x

model = GCN(num_features=total_feature_length)
# prepare the data
data = Data(x=node_features_tensor, edge_index=edge_index, y=target_tensor)
# split the data into train and test and validate
num_nodes = data.x.shape[0]
indices = list(range(num_nodes))
np.random.shuffle(indices)

train_split = int(num_nodes * 0.6)
val_split = int(num_nodes * 0.2)

train_mask = torch.zeros(num_nodes, dtype=torch.bool)
val_mask = torch.zeros(num_nodes, dtype=torch.bool)
test_mask = torch.zeros(num_nodes, dtype=torch.bool)

train_mask[indices[:train_split]] = True
val_mask[indices[train_split:train_split + val_split]] = True
test_mask[indices[train_split + val_split:]] = True

data.train_mask = train_mask
data.val_mask = val_mask
data.test_mask = test_mask
# train the model
model = GCN(num_features=total_feature_length)
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_func = torch.nn.MSELoss()
# record the losses
train_losses = []
val_losses = []

# initialize the best loss
best_val_loss = float('inf')
best_epoch = 0

# use the validation set to find the best model
for epoch in range(14000):
model.train()
optimizer.zero_grad()
out = model(data)
loss = loss_func(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
train_losses.append(loss.item())

# evaluate on validation set
model.eval()
with torch.no_grad():
pred = model(data)
val_loss = loss_func(pred[data.val_mask], data.y[data.val_mask])
val_losses.append(val_loss.item())

# update the best model
if val_loss < best_val_loss:
best_val_loss = val_loss
best_epoch = epoch

Now, we can get the best epoch and the best validation loss and use the best model to predict the test set .

# print the best epoch and the best validation loss
print(f'Best Epoch: {best_epoch}')
print(f'Best Validation Loss: {best_val_loss}')
# use the best model to predict the test set
model = GCN(num_features=total_feature_length)
optimizer = optim.Adam(model.parameters(), lr=0.01)

for epoch in range(best_epoch + 1):
model.train()
optimizer.zero_grad()
out = model(data)
loss = loss_func(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
# switch to evaluation mode
model.eval()

# predict the total fare from location 1 to other locations
with torch.no_grad():
pred = model(data)
test_loss = loss_func(pred[data.test_mask], data.y[data.test_mask])
print(f'Test Loss: {test_loss.item()}')

# convert the tensor to numpy array
predicted_fares = pred.numpy()

# print the predicted fares
for i, loc_id in enumerate(taxi_zones_gdf['LocationID']):
print(f"Predicted fares from Location {loc_id} to others:")
for j, dest_id in enumerate(taxi_zones_gdf['LocationID']):
print(f" To Location {dest_id}: {predicted_fares[i, j]:.2f}")
print()

7.4 Second method: use the average to ecpamd the trip feature

We try to determine the number of destinations and calculate the average features for each node and pad to maximum length. We then list to store features for each node and create the graph and visualize it like before.

Now, we define GCN model again but with different way.

# Define GCN model
class GCN(torch.nn.Module):
def __init__(self, num_features, num_destinations):
super(GCN, self).__init__()
self.conv1 = GCNConv(num_features, 16)
self.conv2 = GCNConv(16, num_destinations)

def forward(self, data):
x, edge_index = data.x, data.edge_index
x = self.conv1(x, edge_index)
x = F.relu(x)
x = F.dropout(x, training=self.training)
x = self.conv2(x, edge_index)
return x
# Initialize and train the model
model = GCN(num_features=total_feature_length, num_destinations=num_destinations)
optimizer = optim.Adam(model.parameters(), lr=0.01)
loss_func = torch.nn.MSELoss()
# Prepare target values (total fare from each node to other nodes)
target_matrix = np.zeros((num_destinations, num_destinations))
for loc_id in taxi_zones_gdf['LocationID']:
loc_idx = location_to_index[loc_id]
for dest_id in taxi_zones_gdf['LocationID']:
dest_idx = location_to_index[dest_id]
if (loc_id, dest_id) in grouped.index:
target_matrix[loc_idx, dest_idx] = grouped.loc[(loc_id, dest_id), 'total_fare']

target_tensor = torch.tensor(target_matrix, dtype=torch.float)
# Split the dataset
num_nodes = len(G.nodes)
indices = list(range(num_nodes))
np.random.shuffle(indices)
train_split = int(num_nodes * 0.6)
val_split = int(num_nodes * 0.2)

train_mask = torch.zeros(num_nodes, dtype=torch.bool)
val_mask = torch.zeros(num_nodes, dtype=torch.bool)
test_mask = torch.zeros(num_nodes, dtype=torch.bool)
train_mask[indices[:train_split]] = True
val_mask[indices[train_split:train_split + val_split]] = True
test_mask[indices[train_split + val_split:]] = True

data = Data(x=node_features_tensor, edge_index=edge_index, y=target_tensor,
train_mask=train_mask, val_mask=val_mask, test_mask=test_mask)
# Function to compute loss
def compute_loss(pred, target, mask):
loss = torch.zeros(1, device=pred.device)
for node_idx in torch.where(mask)[0]:
node_loss = loss_func(pred[node_idx], target[node_idx])
loss += node_loss
return loss / torch.sum(mask)
# note AI tool help me with this part
# Training loop
train_losses = []
val_losses = []
best_val_loss = float('inf')
best_epoch = 0

for epoch in range(14000):
model.train()
optimizer.zero_grad()
out = model(data)
train_loss = compute_loss(out, data.y, data.train_mask)
train_loss.backward()
optimizer.step()
train_losses.append(train_loss.item())

model.eval()
with torch.no_grad():
pred = model(data)
val_loss = compute_loss(pred, data.y, data.val_mask)
val_losses.append(val_loss.item())
if val_loss < best_val_loss:
best_val_loss = val_loss
best_epoch = epoch

# Then Print best epoch and best validation loss

After this, we retrain the model and evaluate on the test set.

7.4 Solve the vehicle allocation problem

In addressing the complex issue of vehicle allocation for urban taxi and ride-sharing services, we’ve pivoted to a sophisticated approach, utilizing the principles of a genetic algorithm. This choice is motivated by the inherently dynamic and variable nature of urban transport demand, which is influenced by factors such as time, day, local events, and even weather conditions. Our methodology is encapsulated in several meticulously crafted steps:

Fitness Function Formulation: At this juncture, we compute the total potential revenue. This calculation is based on a dual input: the vehicle distribution across various zones and the pricing data as predicted by our Graph Neural Network model.

Population Initialization: We embark on this journey by crafting an initial scheme for the distribution of vehicles. This blueprint serves as our foundational model for subsequent enhancements.

Selection Strategy: We then embark on a process of selective enhancement. Here, allocation plans that demonstrate superior performance — in terms of revenue generation — are chosen for further development.

Crossover Technique: In this innovative phase, we exchange segments of the allocation strategies between pairs of chosen plans. This exchange is akin to genetic crossover and is aimed at spawning novel, potentially more effective, allocation configurations.

Mutation Dynamics: Introducing an element of randomness, we alter segments of certain chosen plans. This step is vital to inject diversity into our solutions, steering clear of potential local maxima.

Iterative Evolution: The process undergoes repeated cycles of selection, crossover, and mutation. This iteration continues until we reach a defined endpoint, which could either be a specific number of cycles or a plateau in the improvement of our solutions.

Through this evolutionary-inspired algorithm, our goal is to adeptly navigate the complexities of vehicle allocation in bustling urban landscapes, thereby enhancing the efficiency and economic viability of taxi and ride-sharing operations.

# Fitness function
def fitness(individual, pred_prices):
total_revenue = 0
for i in range(num_locations):
# Calculate total revenue for each location
total_revenue += sum(pred_prices[i, :]) * individual[i]
return total_revenue
# Initialize population
population = np.random.randint(0, num_vehicles, (population_size, num_locations))
population = np.array([ind / ind.sum() * num_vehicles for ind in population]) # Normalize to maintain total number of vehicles
# Genetic Algorithm main loop
for generation in range(num_generations):
# Evaluate fitness
fitness_values = np.array([fitness(ind, pred) for ind in population])

# Selection
sorted_idx = np.argsort(fitness_values)[::-1] # Sort by fitness in descending order
population = population[sorted_idx] # Select individuals with highest fitness

# Crossover
new_population = []
for _ in range(population_size // 2):
parent1, parent2 = population[np.random.choice(range(population_size), 2, replace=False)]
if random.random() < crossover_rate:
crossover_point = random.randint(1, num_locations-1)
child1 = np.concatenate([parent1[:crossover_point], parent2[crossover_point:]])
child2 = np.concatenate([parent2[:crossover_point], parent1[crossover_point:]])
new_population.extend([child1, child2])
else:
new_population.extend([parent1, parent2])

# Mutation
for individual in new_population:
if random.random() < mutation_rate:
mutation_point = random.randint(0, num_locations-1)
individual[mutation_point] = random.randint(0, num_vehicles)

# Update population
population = np.array([ind / ind.sum() * num_vehicles for ind in new_population]) # Normalize
# Output the best solution
best_individual = population[np.argmax([fitness(ind, pred) for ind in population])]
print(f"Best Vehicle Distribution: {best_individual}")

8 Conclusion

As a conclusion, in our project, we addressed the challenge of inconsistent trip data across different nodes by implementing two innovative methods. Firstly, we vectorized features to create an average of trip data, and secondly, we expanded smaller data sets to align with larger ones. This approach not only made our analysis more streamlined but also strengthened our model’s robustness.

References

[1] Siddhartha Banerjee, Ramesh Johari, and Carlos Riquelme. Dynamic pricing in ridesharing platforms. ACM SIGecom Exchanges, 15:65–70, 09 2016.

[2] Taxi & Limousine Commission. About TLC. https: //www.nyc.gov/site/tlc/about/about-tlc.page, 2023. Accessed: 2023–10–25.

[3] Taxi & Limousine Commission. TLC Trip Record Data. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page, 2023. Accessed: 2023- 10–25.

[4] Li Cui, Qingyuan Wang, Hongquan Qu, Mingshen Wang, Yile Wu, and Le Ge. Dynamic pricing for fast charg ing stations with deep reinforcement learning. Applied Energy, 346:121334, 2023.

[5] Jintao Ke, Xiaoran Qin, Hai Yang, Zhengfei Zheng, Zheng Zhu, and Jieping Ye. Predicting origin-destination ride-sourcing demand with a spatio-temporal encoder decoder residual multi-graph convolutional network. Transportation Research Part C: Emerging Technologies, 122:102858, jan 2021.

[6] Yixiang Lim, Fengji Tan, Nimrod Lilith, and Sameer Alam. Variable taxi-out time prediction using graph neu ral networks. 12 2021.

[7] Shaohui Liu, Chengyang Wu, and Hao Zhu. Graph neu ral networks for learning real-time prices in electricity market. ArXiv, abs/2106.10529, 2021.

[8] Shuheng Mo. Uber NYC TLC data | Exploratory Data Analysis. https: //www.kaggle.com/code/shuhengmo/uber-nyc-tlc-data-exploratory-data-analysis, 2023. Accessed: 2023–10–25.

[9] Chiwei Yan, Helin Zhu, Nikita Korolko, and Dawn Woodard. Dynamic pricing and matching in ride-hailing platforms. Naval Research Logistics (NRL), 67, 11 2019.

--

--