Restaurant Revenue Prediction using Gradient Boosting Regression and Principal Component Analysis

5 min readMay 12, 2018

Introduction

TFI (Tab Food Investments) is the owner of some Restaurant Chains including Burger King, Popeyes and Sbarro in Turkey and China. The goal of this Kaggle competition was to predict the annual revenue of their restaurants across Turkey.

Even thought this is a 3 years old competition, I found it very interesting since the training set only consisted on 137 entries while the validation set had 10,000. At the same time all the numerical features provided were coded from P1 to P37, so there is no way to know what the numbers on those features mean. Therefore the main challenge was to build a robust model trained on 137 entries that performed well on a dataset 70 times larger, while carefully removing the features that were not relevant.

At the end, with a RMSE of 1,720,781.05, I was able to improve the results of the winner (1,727,811.485) using one of his ideas of applying root transformation for some variables, but also adding my own ideas to the process.

Data Exploration and Transformation

The data dimension is 137 rows by 44 columns, and these columns can be divided in Categorical and Numerical Data. First let’s handle the numerical data that originally includes P1 to P37, and Revenue which our target.

Before visualizing the numerical data, we can transform the ‘Open Date’ feature to numerical by computing the number of years the restaurant has been open subtracting the year the restaurant was opened to the year we currently are:

data= pd.read_csv('train.csv')now= datetime.datetime.now()
data['Open Date']= pd.to_datetime(data['Open Date'], 
                    format= '%m/%d/%Y')
data['years_old']= now.year - pd.DatetimeIndex(
                              data['Open  Date']).year

Now we segregate the numerical data (including the new ‘years_old’ feature) from the categorical data to visualize it an scatter plot with each feature as X and Revenue as Y.

Scatter plots of each numerical variables in the X axis and Revenue in the Y Axis,

Here we can see that some features have little to no relation with Revenue hence can be remove from the training set.

to_drop = ['P3', 'P7', 'P8', 'P9', 'P10', 'P12', 'P13', 'P14', 
           'P15', 'P16', 'P18', 'P26', 'P27', 'P29', 'P31', 'P34', 
           'P36']num_data.drop(to_drop, axis=1, inplace=True)

Here, I planned to log-transform all the numerical values but since all the features, excluding ‘years_old’ and ‘revenue’, have values of 0 and less, they can’t be log-transformed. After Reading a post by the winner of this competition, I found out he used root transformation on the dataset so I decided to root transform the variables from P1 to P37, while log transforming ‘years_old’ and ‘revenue’.

x= np.log(num_data[['years_old','revenue']])
y= np.sqrt(num_data.drop(['years_old','revenue'], axis=1))

Later on I scaled the P features with the StandardScaler() function while keeping ‘years_old’ as it is.

y_columns= y.columns    
scaler= StandardScaler()      
y= scaler.fit_transform(y)
y=pd.DataFrame(y, columns = y_columns)
num_data = pd.concat([y,x],axis=1)

Now, let’s move to explore the categorical data, which includes the following features:

‘City’
‘City Group’ (Big Cities and Others)
‘Type’: (Food Court, Inline, Drive Thru, Mobile)

A visualization of the mean revenue by city shows the following:

Since ‘City Group’ and ‘Type’ consist only of two and three categories respectively, they can be encoded to numerical values. I used LabelBinarizer() for city Group and Pandas.get_dummies() to transform every Type Category to it’s own feature. (1 when the restaurant is certain Type and 0 when it’s not).

cat_data['City Group'] = lb.fit_transform(cat_data['City Group'])   
                         #big city=0
cat_data = pd.concat([cat_data, pd.get_dummies(cat_data.Type) ], 
                     axis = 1)
cat_data.drop(['Open Date','City', 'Type'], axis=1, inplace=True)

And we join back numerical and categorical data in one data frame and separate our x and y variables

df_training=pd.concat([cat_data, num_data], axis = 1)
x_values_training = df_training.drop('revenue', axis =1)
y = df_training.revenue

Dimensionality Reduction using Principal Component Analysis

When checking the dimensions of the transformed training sets, there are still 25 features.

x_values_training.shape
Out[282]: (137, 25)

I used the PCA() function on scikit-learn to reduce the dimensionality even further by trying ratios of variance between 95% and 99%, being 98% the value with the best accuracy.

pca = PCA(0.98)
pca.fit(x_values_training)
x_values_training = pca.transform(x_values_training)
pca.n_components_
Out[288]: 16

The idea of trying ratio of variance greater that 95% was to remove only the dimensions with the least variances. After transforming the training set we are left with only 16 dimensions that account for the 98% of the variance.

Model Selection, Training and Prediction

I compared Support Vector Regression, Linear Regression, Decision Tree Regression and Gradient Boost Regression models, but the one that performed the best was Gradient Boost Regression followed by Linear Regression, so I decided to focus on GBR and improve its accuracy by tweaking its hyper-parameters. Getting a RMSE of 0.2821 on the log-transformed revenue.

n = pca.n_components_
gbr_reg = GradientBoostingRegressor(max_depth=3, n_estimators=n,  
          criterion='mse', random_state=42) 
gbr_reg.fit(x_values_training, y)
y_hat_gbr= gbr_reg.predict(x_values_training)mse_gbr = mean_squared_error(y, y_hat_gbr)
rmse_gbr = np.sqrt(mse_gbr)
rmse_gbr
Out[290]: 0.28215952408939343

I Applied the same transformation of the training set on the validation set: dropped features that I considered have no impact on revenue, removed the features that were uncommon with training set (there are entries with the Type ‘MB’ which are not present in the training set) and finally, applied the PCA transformation before making the predictions.

#drop features that I consider have no impact on revenue
df_validation.drop(to_drop, axis=1, inplace=True)x_values_val = df_validation
x_values_val = pca.transform(x_values_val) #PCA transformationy_val = gbr_reg.predict(x_values_val) #predict using the GBR model
y_val = np.exp(y_val) 
y_val_gbr = np.round(y_val, 2)
y_val_gbr = pd.DataFrame(y_val_gbr)
y_val_gbr.index.names = ['Id']
y_val_gbr.columns = ['Prediction']pd.DataFrame(y_val_gbr).to_csv('forecast_gbr.csv')

I predicted the values using the previously trained Gradient Boosting Regression model with the transformed validation data set and computed the exponential of the results to “bring them back” from their log scale.

My final score was 1,720,781.05 (RMSE) which improves the results -three years later- of the first place of the competition (1,727,811.485).

Conclusions

After trying different models and playing with the hyper-parameters of the functions, one of the observations I made is that Principal Component Analysis helps reducing the time spent in feature selection, since the result I had using PCA without previously removing by myself the features I considered irrelevant was 1826197.06, not as good as my final solution but way faster considering the time spent on deciding exactly what features I should have discarded and trying the different combinations. So a combined methodology of critical thinking (maybe luck?), trial and error, and use of algorithms (PCA or any other you seem fit) helped me reducing the dimensions by half and achieve a good result, considering that in my opinion dimensionality reduction the most difficult part of the modeling process.