Share Ride Economy: Taxi and Uber Demands in NYC : Data Wrangling

To me, from my last post, some ideas are popped up. My objective is to predict pickups of taxi and Uber in each region of NYC. It seems like time series forecasting in case of machine learning. That is, finding the pattern in the past to predict in the future like predict pickups at Brooklyn on April 2016 that is based on pickups at Brooklyn on April 2015. To learn the pickups density of taxi and Uber in NYC, I collect data of taxi and Uber at the first.

For my work, I need to taxi and Uber data containing latitude and longitude of pickups, and pickup time. But I have difficulty to collect 2015 0f Uber data. It doesn’t include latitude and longitude. So, I turned to use 2014 of Uber data and Taxi data.

NYC Taxi data from Google BigQuery

I used SQL to extract NYC taxi data in 2014 from April to September. To make sure latitude and longitude withinNYC, I set the coordinates boundary (40.5774, -74.15) and (40.9176,-73.7004) for pickup and dropoff.

Green Taxi in 2014 from April to September

SELECT  
pickup_datetime, dropoff_datetime,
trip_distance,
round( pickup_longitude, 3 ) as pickup_long,
round( pickup_latitude, 3 )as pickup_lat,
round( dropoff_longitude, 3 )as dropoff_long,
round( dropoff_latitude, 3 )as dropoff_lat,
passenger_count, fare_amount, tolls_amount, total_amount
FROM `bigquery-public-data.new_york.tlc_green_trips_2014`
where
trip_distance > 0
and dropoff_datetime > pickup_datetime
and pickup_datetime between '2014-04-01' and '2014-09-30'
and dropoff_datetime between '2014-04-01' and '2014-09-30'
and passenger_count is not null
AND NOT ((pickup_longitude = dropoff_longitude) AND (dropoff_latitude = dropoff_latitude))
AND pickup_longitude BETWEEN -74.2563 AND -73.6543
AND pickup_latitude BETWEEN 40.4941 AND 40.9313
AND dropoff_longitude BETWEEN -74.2563 AND -73.6543
AND dropoff_latitude BETWEEN 40.4941 AND 40.9313

Yellow Taxi in 2014 from April to September

SELECT  
pickup_datetime, dropoff_datetime,
trip_distance,
round( pickup_longitude, 3 ) as pickup_long,
round( pickup_latitude, 3 )as pickup_lat,
round( dropoff_longitude, 3 )as dropoff_long,
round( dropoff_latitude, 3 )as dropoff_lat,
passenger_count, fare_amount, tolls_amount, total_amount
FROM `bigquery-public-data.new_york.tlc_yellow_trips_2014`
where
trip_distance > 0
and dropoff_datetime > pickup_datetime
and pickup_datetime between '2014-04-01' and '2014-09-30'
and dropoff_datetime between '2014-04-01' and '2014-09-30'
and passenger_count is not null
AND NOT ((pickup_longitude = dropoff_longitude) AND (dropoff_latitude = dropoff_latitude))
AND pickup_longitude BETWEEN -74.2563 AND -73.6543
AND pickup_latitude BETWEEN 40.4941 AND 40.9313
AND dropoff_longitude BETWEEN -74.2563 AND -73.6543
AND dropoff_latitude BETWEEN 40.4941 AND 40.9313

Uber data

I download data from Kaggle. But it provides data in separated way. I need to combine them into one table.

# u1: April u2: May u3: June, u4: July, u5: August, u6: September 
uber_raw = pd.concat([u1,u2,u3,u4,u5,u6], axis=0)
uber_raw.head()

data cleaning

uber_raw.columns=['pickup_datetime', 'Lat', 'Lon', 'Base']
uber_raw.drop(['Base'], axis=1, inplace=True)
uber_raw["type"]="uber"
uber_raw["pickup_datetime"]=pd.to_datetime(uber_raw["pickup_datetime"]).dt.strftime('%Y-%m-%d %H:%M:%S')

Next post, I explore the pickups density in different regions.