Price Reflections in AirBnb-Berlin

10 min readMay 3, 2020

Predicting the price for a host in Berlin Airbnb by taking features in the data into consideration such as distance from city center, accommodates, bedrooms etc.

This was my first Capstone project after I decided to transition my career into Data Science field.

I have lived in Airbnb many times while I was attending the University, and also while working, for attending conferences. I have always wondered how the hosts determined their price while taking recommendations from the widespread hospitality industry.

Airbnb is a popular place nowadays for many people to book the houses either for an official or a non official event. It has effectively upset the conventional friendliness industry as an ever increasing number of explorers choose to utilize Airbnb as their essential settlement supplier. Since its origin in 2008, Airbnb has seen a tremendous development, with the quantity of rentals recorded on its site developing exponentially every year.

Berlin is one of the hottest markets for Airbnb in Europe, since no more city is more popular. It has over 22,552 listings as of November 2018, according to their website. Airbnb is mainly considered for providing the homes for the price which impact the most for the hosts who would like to offer it.

Let us now consider a scenario. If for instance, let’s say I have a home or an apartment(considering I am a host) in Berlin and I get a new job somewhere else in the city, and I would have to relocate, yet I want to keep my present house, I may think about whether it could be justified, by offering my house on Airbnb to see how much it is worth it could be. Could this maybe be a gainful choice? In any case, it is hard for potential hosts to comprehend what the genuine estimation of their house is. What’s more, since the location and mostly the furniture are usually fixed, is there something else a host can influence — such as commute distances, accommodates, incomes, reviews etc.?

So I considered any special features if it impacts the price in anyway, like special amenities, whether a host is a super host or not, any communication patterns which might have an impact.

THE MACHINE LEARNING LIFE CYCLE

The figure above comprises of the steps which I have undertook as a means to complete my 1st project. Well, the important phase is the data wrangling part, where I have spent most of my time, cleaning the data into a proper format which could be later used in the predictive modeling.

DATA ACQUISITION AND MUNGING

The dataset was obtained from an inside source of Airbnb. But, this was a Kaggle competition, which I have ultimately selected, since Kaggle competitions have a good dataset which is cleaned and organized into different formats of files. This data will be a very important asset in performing all of the cleaning till how well the model performs on this data. The dataset contains a total of 6 different csv files, but I have used one file from the whole dataset because it contained most information needed for my project. It consisted of around 22,000 rows and 96 columns(features) to work upon.

The price determination usually depends on some important factors such as the furniture, location, security deposit, how many guests are trying to live etc..So, even here, it depends mostly on those factors.

Data cleaning

Even though the dataset is clean and not much wrangling was required, there were still missing values in few of the features which I considered it to be important for further analysis of the project. So I decided to go ahead with cleaning those missing values. Also, the data has to be processed so that it can be better used for training the Regressor later on.

Beside this, I have selected the required features from the DataFrame to create a new DataFrame necessary for the analysis in the next step. So, I could call this as a simple PCA here, though it’s exactly not how it’s done :)

Here the target variable(variable to be predicted based on the other independent variables) would be the ‘price’ feature.

I have used 5 steps here to fill missing values in the dataset as needed in the columns selected:

A column will be removed if there are more than 30% of missing values
Imputer is used to fill the values as required for the missing values feature with either mean/median if it’s a numerical column .
If it’s a categorical feature, then it gets filled up with mode.
If a feature has many outliers, I have removed only the unnecessary data from it until meaningful data is retained which will be of importance to us in the analysis.
Applying Feature Engineering to the latitude/longitude column to extract distance from the center of the city.

VISUALIZING THE DATA/EXPLORATORY ANALYSIS

The next step in the analysis is visualizing the data. This was interesting because I wanted to see the patterns and trends of the relationship between different independent variables and between dependent and independent variables. For instance, I wanted to know the most common amenities offered by a host while renting out their home, or something like knowing the average price offered with different neighborhoods.

**Fig 1:** Amenity count used frequently by hosts

This image shows the average price for each neighborhood and the number of properties each one has — **Fig 2**: Average price per neighborhood and property count within each neighborhood

I also wanted to check how the average price remains when the number of accommodates comes into picture. By intuition, we know that as we increase the number of accommodates while renting out a house, the price automatically increases. Let’s see if this is true in this case too.

**Fig 3**: Average price based on number of accommodates

Looks like it is. Also, I can check whether a host is a superhost or not, and if this contributes in a way on how the price might get affected or influenced. Fig 4 indicates the count whether a host was a superhost or not. ‘0’ indicates the absence of superhost title and ‘1’ indicates as true

**Fig 4**: Count whether a host is a superhost

Along with the Exploratory analysis, I have also dealt with some statistical inference questions on what features or variables have a good correlation with the target variable. For this, I have calculated the Pearson Correlation Coefficient(PCC), along with the p-value. According to Wikipedia, PCC and p-value states as follows.

‘Pearson’s correlation coefficient is the test statistics that measures the statistical relationship, or association, between two continuous variables. It is known as the best method of measuring the association between variables of interest because it is based on the method of covariance.’
‘p-value or probability value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct.’

This is an interesting calculation at hand here, because the relationship between different variables can be calculated which proves an important factor in building our model later on. I have Feature Engineered the latitude and longitude to calculate the distance from city center to a certain location. This is a simple piece of function. Below is a small snippet of the code.

def distance_to_mid_center(lat, lon):
    berlin_center = (52.5200, 13.4050)
    accommodation = (lat, lon)
    return great_circle(berlin_center, accommodation).kmdf_revised_columns['distance_to_midcenter'] = df_revised_columns.apply(lambda x: distance_to_mid_center(x.latitude, x.longitude), axis=1)

**Fig 5**: Regression plot for the city center distance and average price

**Fig 6:** Regression plot showing the relation between cleaning_fee and price

Fig 5 and Fig 6 plots a correlation between the dependent variable and independent variable to see how well correlated they are. Fig 5 has a bit of negative linear correlation and explains how the price decreases as we go away from the city mid center, and Fig 6 depicts a high positive correlation between cleaning_fees and price. The correlation coefficient for this was 0.4041, which indicates positive correlation.

Besides this, I also wanted to check how the mean price varies for different neighborhoods, and how many outliers are present, so that I get a clear picture how the data is. This is done by a box plot.

**Fig 7:** Varying Median price for the neighborhoods

I also was interested in checking out the cancellation policy against how prices differ with them. By intuition, we might think that if the airbnb houses provide a cancellation of 14 days, it would be less compared to cancellation before 2 months. So, a plot was made to see how it varies.

**Fig 8:** Cancellation policy and its counts with average prices

**Fig 9:** Heatmap showing price comparision with respect to bedrooms

Heatmap is an important tool for visualizing 2 features or more. Above shows the differences in price with respect to number of bedrooms with hue color as the price.

PREDICTION EXPERIMENTS

Now comes the part where the models will be trained, now that the data acquired is wrangled properly and brought to the same format. Since, this is for predicting the price, which is a continuous variable, I have used Regression models for this project. The Training and Testing data have been split into 70% and 30%(a cross validation technique explained next) respectively using Machine Learning ‘sci-kit’ library.

Just a note here that I have also used a technique of ‘One Hot encoding’ where I have changed the categorical columns to numeric columns, since Machine Learning algorithms can only predict on numeric data.

Mainly, I have used all the features from the sub DataFrame(just to remind, a set of features have been extracted into a separate one for analysis) to predict our dependent variable, price. This is a supervised learning technique since our target variable has been labelled and could used for mapping input-output pairs.

Models and which is better?

Before training the model, I have also used the Standard Scaling technique. In my opinion, this is a useful method and should be performed wherever necessary, since StandardScaler performs the task of Standardization. on a dataset containing variables that are different in scale with mean=0 and standard deviation=1.

There are a couple of models which I have used for the detection of the price. I have used Linear Regression as the baseline model to compare with the other models to see how it performs.

OLS(Ordinary Least Squares)model: helps in setting up important features to select for the models.
Linear Regression
Regularization models(Like Lasso and Ridge): which is useful in penalizing the coefficients, and make the model less complex
Random Forest: consists of an ensemble of Decision Trees. It uses a method called as ‘Bagging’ for sampling the data and aggregates all the Decision trees into a good performing model.

I found out that Random Forest Model was performing better than the other models.

First, I have implemented a ‘GridSearch Cross Validation’ to select the best hyper-parameter for the Random Forest Regressor.
Then I have used it for the model to train and predict the price of the apartment required to provide to a host.

Hyper-parameter optimization is very important for the performance of the model which we use it to train it.

Results

**Fig 10:** Different comparison of models with its training and testing scores

The negative sign here(-) indicates that the results are not that good, since this is a baseline model that we wanted to check. Also, as we can see from Fig 10, the Random Forest model performs better compared to other models. Here I have used two evaluation metrics to measure the performance of my model. Usually, the below metrics are useful in a Regression task like my project.

RMSE: It measures the average error performed by the model in predicting the outcome for an observation.
R2 score: This corresponds to the squared correlation between the observed outcome values and the predicted values by the model. The higher the R-squared, the better the model.

From the model, we can have the important features which are crucial for determining the price. Below is a snippet of the code

RandomForestRegressor.feature_importances_

**Fig 11:** Feature importance with Random Forest Regressor

The scatter plots show the actual and predicted values of the price to determine how well the model has predicted the price with each data point from the test set, to new and unseen data. The Fit of line determines the coefficients and how well it corresponds to fit the data properly. I have used Random Forest Model to predict the price recommendation of a house in Berlin given to a host. I have used n_estimators of 500, which was a good aggregation with ensemble of models.

**Fig 12:** Predicted price values for training and testing set

WHAT’S NEXT?

I have worked on determining a best model to predict the price of an apartment for Airbnb in Berlin for a host who wants to rent his/her apartment. While our model was very good in determining the price, still some other features can be included such as whether a host greeting would be important, working with other csv files like reviews and determining factors important in figuring out the price, which can maybe improve the model’s performance more and mean squared error can reduce. Othercheck some algorithms for our ML models like KNN, PCA, SVM etc. Since the dataset does not contain that many high number of dimensions, PCA was not that useful to reduce, but it’s a worthwhile topic to work on at a later date.

The Github code for all the work would be right here in this repository. Please feel free to have a look.
A special thanks to my Springboard Mentor, Konstantin Palagachev who has supported me all the way during this.

Thanks for all the reading. Kudos!!