# Lyft Surge Pricing Prediction

Imagine, if just for a minute, we live in a world where you order a Lyft and it doesn’t immediately tell you the surge pricing multiplier. How much will the ride be? Photo by Austin Distel on Unsplash

Now you are stuck in the rain wondering how much the ride will be. Well, I got the answer! Or, I can at least show you how to get the answer.

First, we need to get the dataset from Kaggle. This dataset contains two CSVs; one for distance, cab type (Uber or Lyft), price surge, price, etc…, and the other one for the weather. The last one is useful because we will try to predict surge pricing multiplier based on the current weather and the type of service.

Let’s load the data. Like I mentioned above, there are two CSVs, cab_rides.csv and weather.csv.

`cab_df = pd.read_csv('cab_rides.csv')weather_df = pd.read_csv('weather.csv')`

Let’s check the unique ‘surge_multiplier’ values and their counts using panda’s .value_counts()

`cab_df[cab_df['cab_type']=='Lyft']['surge_multiplier'].value_counts()`

A very imbalanced dataset. We will be applying SMOTE (Synthetic Minority Oversample Technique) later to oversample the minority classes. Also, as you can see our target variable is made up of decimals. Wouldn’t that make it a regression problem? Well, seeing as the Surge Multiplier in Lyft is not a continuos function, but rather discrete numbers (1.0, 1.25, 1.5, 1.75, 2.0, 2.5, and 3.0), that makes this a multi-class classification problem.

Data Preprocessing

First, we need to preprocess the data. In the weather_df DataFrame we have some missing values in the rain column. These are empty because there was no rain, so we can fill in those values with 0:

`weather_df = weather_df.fillna(0)`

Now, it’s time to merge the DataFrames. We need to find a matching feature so we can perform an inner join. They both have a ‘time_stamp’ column. They are milliseconds in Unix time. We need to convert them into a DateTime feature using pandas’ .to_datetime() for it to make sense.

`cab_df['date_time'] = pd.to_datetime(cab_df['time_stamp']/1000,unit='s')weather_df['date_time'] = pd.to_datetime(cab_df['time_stamp']/1000,unit='s')`

Et voilà! Now both features are in DateTime and can be easily merged. We will join both DataFrames by the hour, day, and place. The next bit of code might be a little overwhelming, but all it is really doing is getting the city, date, and hour in both DataFrames, so we get matching columns.

`cab_df['merge_date'] = cab_df['source'].astype(str) + ' - ' + cab_df['date_time'].dt.date.astype(str) + ' - ' + cab_df['date_time'].dt.hour.astype(str)weather_df['merge_date'] = weather_df['location'].astype(str) + ' - ' + weather_df['date_time'].dt.date.astype(str) + ' - ' + weather_df['date_time'].dt.hour.astype(str)weather_df.index = weather_df['merge_date']merged_df = cab_df.join(weather_df,on=['merge_date'],rsuffix='_w')`

This next part is very optional. This is just to check that our merged data is consistent, and actually merged properly.

`print(merged_df['merge_date_w']==merged_df['merge_date'])>>> True`

GREAT! Now we do a little feature engineering by extracting the date and hour from the ‘date_time’ columns, and deleting some unnecessary columns.

`merged_df['day'] = merged_df['date_time'].dt.dayofweekmerged_df['hour'] = merged_df.date_time.dt.hourdel_cols = ['time_stamp','id','product_id','date_time','location','time_stamp_w','date_time_w','merge_date_w','merge_date','distance','price']del_cols_df = merged_df.drop(del_cols,axis=1)`

Like I mentioned earlier, this dataset contains data from both Uber and Lyft, but seeing as Uber’s ‘surge_multiplier’ column is filled with ‘1.0’s, it doesn’t concern us. So we are only using data related to Lyft.

`lyft_df = del_cols_df[del_cols_df['cab_type']=='Lyft']`
`lyft_df.shape>>> (564730, 13)`

Now onto the last bit of data preprocessing, and we’ll go into training, I promise!

`def onehot_encode(df, column, prefix):    df = df.copy()    dummies = pd.get_dummies(df[column], prefix=prefix)    df = pd.concat([df, dummies], axis=1)    df = df.drop(column, axis=1)    return dfdef preprocess_input(df,locations=False):    df = df.copy()    # Drop cab_type column    df = df.drop('cab_type',axis=1)    # OneHot encoding    df = onehot_encode(df,'name','name')        # With locations    if locations == True:        df = onehot_encode(df,'destination','d')        df = onehot_encode(df,'source','s')    else:        df = df.drop('destination',axis=1)        df = df.drop('source',axis=1)    # Split the data    y = df['surge_multiplier'].copy()    X = df.drop('surge_multiplier',axis=1)    return X,yX,y = preprocess_input(lyft_df)`

As mentioned earlier, our independent variables will be the weather and service type (Lux, Lyft, Lux Black, Shared, Lux Black XL, Lyft XL), and our dependent variable will be the Surge Multiplier.

This is how our features look:

Great! All of our features are now numbers that the algorithms can work with, so let’s see how our class imbalance is looking; we have a very imbalanced dataset, as seen below. So we will use SMOTE to oversample our minority classes.

`from imblearn.over_sampling import SMOTEoversample = SMOTE()X,y = oversample.fit_resample(X,y)X.shape, y.shape>>> ((3685619, 14), (3685619,))`

Let’s see how our class balance is looking after using SMOTE.

That looks way better! Now let’s get into encoding our target variable. We have 7 classes in total (1.0, 1.25, 1.5, 1.75, 2.0, 2.5, and 3.0). We will have to encode them using Scikit-Learn’s LabelEncoder class.

`from sklearn.preprocessing import LabelEncoderlabel = LabelEncoder()y = label.fit_transform(y)`

Training

Let’s get into training by first using Scikit-Learn’s train_test_split() class to split the dataset into training and test set, and then performing feature scaling using the StandardScaler() class (though not required for our algorithm of choice, but it’s always good practice to perform feature scaling).

`X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7)scaler = StandardScaler()scaler.fit(X_train)X_train = scaler.transform(X_train)X_test = scaler.transform(X_test)`

For our algorithm of choice for training our model, we will be using Random Forest and Extra Trees with OOB (Out-of-Bag) Score (Navnina Bahtia has a great article explaining what OOB Score is, I’d suggest you give it a read!) because these classifiers trade in a little bit of bias for less variance.

Random Forest training:

`from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier(oob_score=True)rf.fit(X_train,y_train)`

Extra Trees training:

`from sklearn.ensemble import ExtraTreesClassifieret = ExtraTreesClassifier(bootstrap=True,oob_score=True)et.fit(X_train,y_train)`

Let’s print out the OOB Scores to see how our classifiers did.

`print(f'OOB Score for Random Forest: {rf.oob_score_}')print(f'OOB Score for Extra Trees: {et.oob_score_}')>>> OOB Score for Random Forest: 0.8879819747257003>>> OOB Score for Extra Trees: 0.8828345542306719`

As you can see, our Random Forest classifier did a little better than the Extra Trees classifier. Let’s take a look into each of the classifiers’ classification report.

There you have it! The Random Forest classifier did a better job than the Extra Trees at predicting what the Surge Multiplier will be using our test set.

Conclusion

Seeing as the Surge Multiplier is not 100% dependent on weather or type of service (Uber’s is actually based on supply/demand, so it’s safe to assume Lyft’s is similar), our model didn’t do a bad job at predicting based solely on our features used.

It also seems that our classifiers overfitted the 2.5 and 3.0 classes. This is because there are only about 260 unique values in the two combined, even after oversampling; very little compared to the 560,000+ unique instances for the remaining classes.

Thank you so much for taking the time to read this article! If you have any doubts, or criticism, make sure to let me know!

## The Startup

Get smarter at building your thing. Join The Startup’s +799K followers.

## The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +799K followers.

Written by

## Arturo Rey

Mechanical Engineer | Data Scientist looking to learn more about data science every day! ## The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +799K followers.

## More From Medium

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium