Using Data Science to Understand the Price Gap of Airbnb Accommodations in Rio de Janeiro

Eduardo Ribeiro Vargas Duarte
The Startup
Published in
5 min readJan 23, 2021
Picture of the city of Rio de Janeiro during the sunset

For this project, I was interestested in using the Airbnb data from a city of Brazil, country where I currently reside. The only data available for Brazil was from Rio de Janeiro.

The dataset used for this project is available at: http://insideairbnb.com/get-the-data.html

Once I used the Airbnb platform to search for house/apartments to rent in Rio and I notice a huge price difference between them, mainly between different neighborhoods. I used this fact as a starting point in my project.

Apartment available to rent in Rio de janeiro costing R$ 129.080 ( ~US$23.600) per night!

The dataset used contains around 25.800 records of listed apartments, houses and rooms in the city of Rio de Janeiro.

To try understand better the available data, three main questions were proposed:

Question 1: Which neighborhoods have the best average rating and how the average price influences on that?

To answer this question, it was considered only neighborhoods with more than 30 records of listed accommodations to consider to get statistical significance.

Also, it was used a technique known as ‘z-score’ (z <3) to remove the outliers present in the dataset related to the price.

The images below shows the neighborhoods grouped by average rating and average price .

Average rating for each neighbourhood
Average rating for each neighbourhood
Average price for each neighbourhood

From the values and plots above, it is possible to note that Joá neighbourhood have the highest average price. Joá is a “luxurious neighbourhood” in Rio, so this gap in the average price seems aceptable.

It is possible to note as well that the 3 neighbourhoods with the highest average rating (Engenho Novo, Alto da Boa Vista and Cosme Velho) are not in the group of the most expensive neighbourhoods (except for Alto da Boa Vista).

Engenho Novo is just in the 47th position (of 50 considered neighbourhoods), while Alto da Boa Vista and Cosme Velho are in the 2th and 18th, respectively.

For 3 most expensive neighbourhoods (Joá, Alto da Boa Vista and São Conrado), they are positioned in the 6th, 2th and 43th , respectively, in the average rating ranking.

Aparently, there is no direct relationship between rent price and user rating. There are even some neighbourdhoods with high average price in the last positions in the rating ranking ( Like São Conrado).

Question 2: The rent price normally increases as the number of people that the house/apartment can accommodate increases?

To this question, it was also considered ‘accommodates’ group with more than 30 records of listed accommodations to consider to get statistical significance.

The image below shows the average price related to the number of people that the house/apartment/room can accommodate.

Average price for each ‘accommodates’ group

Based on the values above, it seems that the average price increases as the number of people that the apartment/house accommodates increases, for the most of the considered groups.

It is not a linear relationship, but it seems to be some correlation between number of people that the apartment/house accommodates and the charged price.

Question 3: Is it possible to predict the rent price of a house/apartment based in its main attributes?

To answer this question, first it was used the some numeric attributes available in the dataset to try predict the rent price value.

For the attribute ‘bathroom_text’, it was necessary some treatment to get the numeric values from this field.

The image below shows the correlation between the numeric values present in the dataset.

Correlation between the numeric attributes

The variables related to reviews doesn’t seem to have a strong relationship with the price, while the variables number of beds, bedrooms, bathrooms and people that the apartment accommodates seems to have a strong relationship.

The 4 variables (beds, bedrooms,bathrooms and accommodates) were used to try to predict the rent price using a linear regression model.

The metric used was the rsquared. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

Closer this value is from 1, the better the model predicts the values.

After training the model, these were the results obtained from the training ans test set:

Results for the linear model with the numeric features

The results with just the numeric attributes were not quite good . Thus,it was added some categorical features to improve the model performance. Also, a different algorithm was used ( RandomForestRegressor) to try to improve the model accuracy.

The categorical features were “room_type” and and “property_type”. To add this categorical features to the model, it was created dummy variables that takes only the value 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome.

After training the RandomForestRegressor model, these were the results obtained from the training and test set:

Results for the RandomForestRegressor model with numeric and categorical features

The results improved considerably with the changes. For the training data the rsquared value was bigger than 0.7, which indicates a strong positive relationship.

For the test data, the results were not quite good. Some different model can be used to try to improve this value, or try to optimize the hyperparameters used in the models.

Conclusion

In this article, we took a look at the Airbnb data from Rio de Janeiro to understand better the price gap between different accommodations.

  1. From the available data it was possible to note that the average price from each neighbourhood it is not directly related to the average rating. Also, there is a considerable price difference between each neighbourhood, mainly in Joá.
  2. It was possible to note as well that the average price from each accommodation normally increases as the number of people that it can be accommodate increases.
  3. Finally, it was used a machine learning model to try to predict the price of a house/apartment. The model had good results for the training set, but for the test set, it didn’t perform very well.

--

--