Taxi Demand Prediction on Time Series Data with Holt Winter forecasting (loss 0.02)

Published in

Analytics Vidhya

14 min readSep 11, 2019

Introduction

Today we going to understand an interesting case study, New York Taxi demand prediction. In simple words, the driver is waiting for passengers in any corner of the city and we can predict the numbers of pickups in those particular regions for the next 10 minutes. So the driver will know how much pickups he can expect in 10 minutes. In this case, I have reduced MAPE (Mean Absolute percentage error) to 0.02. I have used Triple Exponential Smoothing, which very effective feature when you are dealing with time-series data.

For this case study, we borrowed data from the NYC Taxi and Limousine Commission. Actual data can be found here. NYC Taxi and Limousine has collected so much data all taxi rides in the city.

You can find source code for this case study here.

1. Information on taxis:

Yellow Taxi: Yellow Medallion Taxicabs

These are the famous NYC yellow taxis that provide transportation exclusively through street-hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged.

For Hire Vehicles (FHVs)

FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.

Green Taxi: Street Hail Livery (SHL)

The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides.

Credits: Quora

For this case study, I am considering only the yellow taxis for the time period between Jan — Mar 2015 & Jan — Mar 2016

2. Business/Real-world problem Overview

The end goal for this case to predict the number of pickups in that given location at ten minutes time interval. We need to break up whole new york city into small regions and predicted numbers of pickups for those regions.

3. Mapping to ML problem:

It is clearly a regression problem on time series data. We have the following challenges:

How to breakup NYC into regions.
Every region of NYC has to be broken up into 10 min interval.

4. Performance metrics

Mean Absolute percentage error.
Mean Squared error.

5. Data Information

We should spend more time in understanding the data. Understanding data can give us a clear idea to solve the problem.

Fig.01. There are 19 features in a dataset. Each row in the dataset is a single ride in NYC

We can see that in Fig.01 There are 19 different features like Dropoff longitude, Dropoff latitude, pickup datetime, pickup latitude longitude and so on. Note each row in the dataset is a trip.

5.1 Data Cleaning

Data cleaning is the most important step in ML. It is obvious that data may have outliers and error. We will perform the Unilateral Analysis.

5.1.1. Pickup Latitude and Pickup Longitude

We will consider only NYC for this case study, New York is bounded by the location coordinates(lat, long) — (40.5774, -74.15) & (40.9176,-73.7004) so any coordinates out of these coordinates, are simply discard. We are only concerned with pickups which originate within New York. Source code is here.

We can notice that there are some points, those pointing taxi pickup location in Ocean, which is obviously outlier point, so I will remove such points. I will consider only coordinates(lat,long) — (40.5774, -74.15), (40.9176,-73.7004).

5.1.2. Dropoff Latitude & Dropoff Longitude

Similar to pickup lat, long we need to filter data points which are not within our defined boundary if we found any I will simply discard it. Source code is here.

5.1.3. Trip Duration

In our data set some points found negative, it obvious trip duration should be positive only. Whereas some points showing 50k minutes trip. Using box cox plot, we can visualized outliers for trip duration feature and remove it. Source code is here.

5.1.4. Speed

Similarly, we used a box cox plot for data visualization. Some point has 192857142 miles/hour speed, which obviously error point. I will remove such points. The avg speed in Newyork speed is 12.45miles/hr. Source code is here.

5.1.5. Trip Distance

Similarly, Max trip distance showing using percentile is 258.9 miles. We can say that this is an outlier point. So will remove such points. Source code is here.

5.1.6. Total Fare

Again, here also max total fare showing $3950612, which is practically impossible values so I will remove those points.

After removing all outliers and erroneous points, filtered data size as follows,

Removing outliers in the month of Jan-2015
----
Number of pickup records =  12748986
Number of outlier coordinates lying outside NY boundaries: 293919
Number of outliers from trip times analysis: 23889
Number of outliers from trip distance analysis: 92597
Number of outliers from speed analysis: 24473
Number of outliers from fare analysis: 5275
Total outliers removed 377910
---
fraction of data points that remain after removing outliers in % 97.03576425607495

6. Data-preparation

After removing outliers and erroneous points, our next task is to divide whole NYC into segments also break whole time span into 10 minutes time bins.

6.1 Clustering/Segmentation

Here I will use K-Means algorithm to break a city into regions. Kmeans algorithm has a property that it will create clusters of same size. While segmentation we need to take care that cluster size should not be too small or too large. So two condition will help us those are as follows,

Inter cluster maximum distance between two clusters should <2 miles
Inter cluster minimum distance between two cluster should >0.5 miles

Another thing is we need to worry about we need to find what is optimal numbers of ‘K’ in Kmean algorithm. Obviously, we have to hyperparam tune and find the optimal minimal K value. Source code is here.

#trying different cluster sizes to choose the right K in K-means
coords = frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']].values
neighbours=[]

def find_min_distance(cluster_centers, cluster_len):
    nice_points = 0
    wrong_points = 0
    less2 = []
    more2 = []
    min_dist=1000
    for i in range(0, cluster_len):
        nice_points = 0
        wrong_points = 0
        for j in range(0, cluster_len):
            if j!=i:
                distance = gpxpy.geo.haversine_distance(cluster_centers[i][0], cluster_centers[i][1],cluster_centers[j][0], cluster_centers[j][1])
                min_dist = min(min_dist,distance/(1.60934*1000))
                if (distance/(1.60934*1000)) <= 2:
                    nice_points +=1
                else:
                    wrong_points += 1
        less2.append(nice_points)
        more2.append(wrong_points)
    neighbours.append(less2)
    print ("On choosing a cluster size of ",cluster_len,"\nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2):",
           np.ceil(sum(less2)/len(less2)), "\nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2):", np.ceil(sum(more2)/len(more2)),
           "\nMin inter-cluster distance = ",min_dist,"\n---")

def find_clusters(increment):
    kmeans = MiniBatchKMeans(n_clusters=increment, batch_size=10000,random_state=42).fit(coords)
    frame_with_durations_outliers_removed['pickup_cluster'] = kmeans.predict(frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']])
    cluster_centers = kmeans.cluster_centers_
    cluster_len = len(cluster_centers)
    return cluster_centers, cluster_len

# we need to choose number of clusters so that, there are more number of cluster regions 
#that are close to any cluster center
# and make sure that the minimum inter cluster should not be very less
for increment in range(10, 100, 10):
    cluster_centers, cluster_len = find_clusters(increment)
    find_min_distance(cluster_centers, cluster_len)

Above code is finding hyperparameter from (10 to 100) and we after running we can see that K = 40 works a lot better. So we will choose K =40. So in simple words, we can divide NYC into 40 regions. So we can have 40 centroids. Plotting those centroids on NYC graph.

We can see that most centroids allocated to Manhattan region which crowded area in NYC and other regions are sparse.

6.2 Time-binning

We have segmented regions, now we need to segment time into ‘Time-Bins’. As we know that we will segment time into 10 minutes time interval. 10 minutes =600seconds. We will convert all DateTime format into Unix timestamp format. Then divide it by 600. So if you consider January 2016, there are 24 hours, 31 days and 60 seconds(24*31*60/10 =4464bins). Similarly for February 2015 = 4176bins (24*29*60/10).

So now we have two pieces, 40 cluster and 4464 bins according to January 2016 data. For February, 40 clusters and 4176 bins. For march, 40 clusters and 4464 bins. Source code is here.

7. Smoothing

There are some bins which have 0 pickups. At mid-night time, nearly all time bins will show zero pickups. Not necessary all but most of the pickup will predict 0 value. This 0 will be a headache when we are doing calculation part, it may produce divide by zero error. Instead, we can settle up all zero values with neighboring values. Suppose t,t1,t2 are time bins. at t=50 pickups, at t1=0 pickups and at t2 = 100 pickups, so we will smoothing like (100+50+0)/3=50 so t=50, t1=50 and t2=50.Source code is here.

Now we have 4464 time bins with numbers of pickups in each 40 regions. we place all time-bins in a queue. Then we have to handle below boundary cases,

Case 1: When we have the last/last few values are found to be missing,hence we have no right-limit here
Case 2: When we have the missing values between two known values
Case 3: When we have the first/first few values are found to be missing,hence we have no left-limit here.

Case 1: Consider 4 time bins, t1,t2,t3,t4 where pickups value are 10,2,0,0. Right values missing .ie at t1,t2 there are 10,2 pickups and at t3 ,t4 zero pickup. so we will take average (10+2+0+0)/4=3 and after smoothing pickups (3,3,3,3).

Case 2: similar t1,t2,t3,t4 time bins where pickups value are 10,0,0,2. Middle values are missing. ie t1=10, t2=0,3=0 and t4=2 pickups. similarly after smoothing data we can have (3,3,3,3).

Case 3: same t1,t2,t3,t4 time span and 0,0,2,10. Left values are missing.similarly after smoothing data we can have (3,3,3,3).

So we use smoothing for Jan 2015 data since it acts as our training data and we filled missing values with zero for Jan 2016 data.

8. Time series and Fourier Transforms

Till now we are talking about data is time series lets plot the data on a graph where X-axis is time axis and Y-axis is the amplitude of pickups.

If you don’t know anything about Fourier transformation then please check here. But in some words, Fourier transform can be used to decomposed the waveform /signal. From now I will use wave, waveform and signal word interchangeably. Look at above repetitive in nature waveform, It actually consists of multiple waves with different frequency and amplitude. Suppose sine wave s1 has some amplitude and frequency, sine wave s2 has some amplitude and frequency and s3 wave also have some amplitude and frequency then will combine all s1,s2 and s3 and which will be composed sine wave or addition of all 3 waves. Lets called it as ‘S’.

When we apply Fourier transform on top of ‘S’ wave, it will decomposed and we can get s1,s2 and s3 back. Resultant s1,s2 and s3 will be in ‘frequency-domain’. In time series problem Fourier transform can return most valuable features. Remember in ‘frequency domain’ we can analysis signal very effectively. We applied Fourier transform on our signal.

We can observe in Fig02 image that we have a time series data which repetitive in nature. In a single day, we have 1440 minutes.(24 hours*60 minutes=1440 mins). We divide it into 10 minutes interval so fig02 wave repetitive after 144 units.(1440 mins/10mins=144mins). In Fig03 is Fourier Transform of 24 hour data.

In Fig03. the first highest peak is DC component of the Fourier Transform. We will ignore DC component. The second peak is 1/144 which represents 24 hours, the third peak is 1/72 which represent 12 hours of day. Fourth peak is 1/36 which represent 6 hours of day. so on.

9. Baseline Model

We have two strategies for baseline model.

Ratio Features
Previous value feature

Suppose its 15 January 2016, 5.40pm just say it is current timestamp and I want to predict Taxi demand for next 10 minutes. Then 15 January 2016, 5.40pm we will call as P_t 2016.

At same time P_t 2015 is nothing but 15 January 2015, 5.40pm. from now I will use the same notation.

Also, note that P_t or R_t represent current value.

P_t-1 or R_t-1 represent past value.

P_t+1 or R_t+1 represent future value.

As we can see formulation,For Ratio features, suppose current timestamp is t. So R_t= P_t 2016 / P_t 2015. Say future timestamp is t+1, formulation change as follows, R_t+1 = P_t+1 2016/ P_t +1 2015. where P_t+1 2016 is (predicted value) and P_t+1 2015 is actual value/ available value.

P_t+1 2016 = R_t+1 * P_t+1 2015. where P_t+1 2015 will be known.

Previous feature it easiest way, It will consider past timestamp value. say current timestamp is t then it will use t-1, t-2,t-3,…. data.

9.1.1. Simple Moving Average

It's really simply formulated, it is a simple average of previous n value of “Ratio features”. Best value for n is ‘3’. I have done some hyperparameter tuning for n value and I found that ‘3’ is the best value for our case. Source code is here.

Similarly for “previous value feature” we took an average of previous n values, after tuning parameter, I found n =1 is performing better.

Note In simple moving average,we give equal weight to t-1, t-2, t-3, instead can we try different weight scheme?

9.1.2. Weighted Moving Average

“Ratio Features”, we can observe formulation in fig05. The most recent timestamp has higher weight and older one has less weight. It is very similar to Simple Moving Average except in Simple Moving Average we give equal weight to all stage, here different weight is assigned. Source code is here.

9.1.3. Exponential Weighted Moving Average

fig06 Exponential Weighted Moving Average

Suppose I want to predict R’_t value at current timestamp and R_t-1(past actual value) and R’_t-1(past predicted value). So I can use both values for future prediction. In above formulation, fig06, we have observed alpha value.

say alpha ==0 then R’_t = R’_t-1… (we ignore actual value)

say alpha == 1 then R’_t = R_t-1…. (we ignore predicted value)

Here we have to consider both ‘actual’ and ‘predicted’ value, so we need to well tune “alpha value” and use accordingly. Source code is here.

Suppose alpha = 0.8,

R′t = α∗R’_t−1 + (1−α)∗R_t-1

R′t = (0.8 ∗ R’_t−1) + (0.2 ∗ R_t-1)

10. Base model Result

We can clearly see that previous value feature working better.

11. Apply ML Regression Models

Till now we are using a simple baseline model, now its time to apply Regression models. we will use Linear Regression, Random Forest Regression and XGboost regression models.

11.1. Data preparation

We are going to split data into 70–30 parts.70% data for train and 30 for test. Initial we have

# Now we computed 18 features for every data point 

# 1. cluster center latitude
# 2. cluster center longitude
# 3. day of the week 
# 4. f_t_1: number of pickups that are occurred previous t-1th 10min interval
# 5. f_t_2: number of pickups that are occurred previous t-2th 10min interval
# 6. f_t_3: number of pickups that are occurred previous t-3th 10min interval
# 7. f_t_4: number of pickups that are occurred previous t-4th 10min interval
# 8. f_t_5: number of pickups that are occurred previous t-5th 10min interval
# 9. Exponential Weighted Moving Avg# --------------------Fourier Features --------------------------# 10. a_1 : amplitude corresponding to 1st highest fourier transform value
# 11. f_2 : frequency corresponding to 2nd highest fourier transform value
# 12. a_2 : amplitude corresponding to 2nd highest fourier transform value
# 13. f_3 : frequency corresponding to 3rd highest fourier transform value
# 14. a_3 : amplitude corresponding to 3rd highest fourier transform value
# 15. f_4 : frequency corresponding to 4th highest fourier transform value
# 16. a_4 : amplitude corresponding to 4th highest fourier transform value
# 17. f_5 : frequency corresponding to 5th highest fourier transform value
# 18. a_5 : amplitude corresponding to 5th highest fourier transform value

Note: Fourier features mentioned above in section 8. Also, note we have ignored f_1 component which is DC component in fourier transform as described in previous section 8.

12. Result

We can observe that simple model like ‘Exponential WMA’ work fairly well than complex models like Random Forest and XGboost. We want MAPE value as small as possible.

13. Holt Winter forecasting

From above analysis it very clear that “previous value” feature is simple and most effective feature and if data is time series then we should try Holt Winter Forecasting it may improve our results.

In above formulation, Level can said as expected value or predicted value. Trend can be said, suppose there is XY graph, rate of change X with respect to Y. is called as trend. It can be uptrend or downtrend. The increasing or decreasing value in the series. If value is increasing then uptrend and value decreasing then downtrend. Seasonal if time series appeared to be repetitive over regular interval then we can say that data is seasonal.

The idea behind triple exponential smoothing is to apply exponential smoothing to the seasonal components in addition to level and trend. The smoothing is applied across seasons, e.g. the seasonal component of the 3rd point into the season would be exponentially smoothed with the one from the 3rd point of last season, 3rd point two seasons ago, etc.

Below code is used for Holt Winter Forecasting (Triple Exponential Weighted Moving Average).

def initial_trend(series, slen):
    sum = 0.0
    for i in range(slen):
        sum += float(series[i+slen] - series[i]) / slen
    return sum / slen

def initial_seasonal_components(series, slen):
    seasonals = {}
    season_averages = []
    n_seasons = int(len(series)/slen)
    # compute season averages
    for j in range(n_seasons):
        season_averages.append(sum(series[slen*j:slen*j+slen])/float(slen))
    # compute initial values
    for i in range(slen):
        sum_of_vals_over_avg = 0.0
        for j in range(n_seasons):
            sum_of_vals_over_avg += series[slen*j+i]-season_averages[j]
        seasonals[i] = sum_of_vals_over_avg/n_seasons
    return seasonals


def triple_exponential_smoothing(series, slen, alpha, beta, gamma, n_preds):
    result = []
    seasonals = initial_seasonal_components(series, slen)
    for i in range(len(series)+n_preds):
        if i == 0: # initial values
            smooth = series[0]
            trend = initial_trend(series, slen)
            result.append(series[0])
            continue
        if i >= len(series): # we are forecasting
            m = i - len(series) + 1
            result.append((smooth + m*trend) + seasonals[i%slen])
        else:
            val = series[i]
            last_smooth, smooth = smooth, alpha*(val-seasonals[i%slen]) + (1-alpha)*(smooth+trend)
            trend = beta * (smooth-last_smooth) + (1-beta)*trend
            seasonals[i%slen] = gamma*(val-smooth) + (1-gamma)*seasonals[i%slen]
            result.append(smooth+trend+seasonals[i%slen])
    return result

Key part to improve the result is fine-tuning alpha, beta and gamma values. I have fine-tuned alpha,beta and gamma values on below scale.

alpha = [0.1,0.2,0.3,0.4]
beta = [0.1,0.15,0.20,0.25]
gamma = [0.1,0.3,0.4,0.5,0.65,0.75,0.85,0.95]Best value I found for our problems is :alpha=0.1
beta=0.1
gamma = 0.95

Using above values, I added one more feature to the list. I added “19th feature” as named ‘triple_Exp’ feature. Then apply “19 features” model to Random Forest and XGboost.

14. Final Results

+-------------+---------------+---------+
|    Metric   | Random Forest | XgBoost |
+-------------+---------------+---------+
| Train MAPE  |    0.04707    | 0.03034 |
|  Test MAPE  |    0.04657    | 0.02932 |
+-------------+---------------+---------+

This is brilliant. We have improved model performance to great extent. We were expecting loss is 0.12 and we reduce loss to 0.02, which is brilliant.

So far our best model is XGboost Regressor with Test MAPE: 0.02932

15.Conclusion

In time-series data it better to use “Moving Avg”, “Weighted MA”, “Exponential WMA” , “Triple EWMA” as a feature. Also “Fourier Transform” can give better features, but in our case, it not worked as expected.

Fine Tune is key part in model performance improvement.

Source code for this case study can be found here.

16. References

You can check out the similar interesting blogs here:

Memes Detection App using Deep Learning.

Zomato Rate Prediction Blog.