First Look at the Data pt. 3— Airbnb#005

Douglas Rocha
3 min readAug 30, 2022

--

Welcome to another one of my journals on my Airbnb project! Read more about it here. This is the following step to First Look at the Data pt. 2 — Airbnb#004.

Now there are only 3 files left, let’s get rid of some here. As I’ve said, next in line is listings.csv, with around 50Mb of data. LibreOffice Calc could open it because it had only 24,882 rows, being the smallest table so far. On the other hand, it had, from afar, the largest number of columns, with exactly 74 columns of information (all the way from column A to column BV). That means I can’t go over every single column as I have in past journals. Instead, I’m going over some of the columns I think are most interesting.

Thank God that the Data Dictionary has a lot of information on these columns so I don’t have to wander through such a huge dataset to understand what each column represents.

The first column, of course, is the id of the listing as id. Later comes the name and description of it, as well as a neighborhood_overview talking about the neighborhood of the listing. Next, we have some information about the host like host_id, host_name, host_location, host_about, etc.

After those, we have information on the actual place of the listing like location (latitude and longitude), property_type, room_type, bathrooms, bedrooms, beds, and, most important of all, the price.

At last, the table provides us with some interesting information on reviews.

There are, as I’ve said, a ton of columns I couldn’t go through here, but if I happen to use them later I will make sure to go over what it represents.

This process took long enough for me to finish this article here but as listings_summary.csv is but a smaller version of listings.csv, I believe I can talk about it briefly here. It has 18 columns instead of 74 and really does bring an interesting summary of the data on its bigger version with these columns:

It also has a reviews_per_month column that is not registered nor described in this dictionary. This being such a summarized version of the other dataset can be easily used to take good conclusions without having to deal with so much data. But, of course, as I’m mostly here to learn, I will use both.

To finish looking at the datasets, neighborhoods.csv is a very straightforward table with literally nothing but a list of neighborhoods of Rio de Janeiro. It has 161 rows with 160 different neighborhoods including the most well knows ones like Barra da Tijuca, Copacabana, Complexo do Alemão, and Rocinha. If the Data Dictionary of these datasets didn’t care to mention this dataset I’m also not going to spend so much time talking about it.

That ends this round of looking into data just to find what it talks about. See you in the next journal!

--

--

Douglas Rocha

Software Engineer | Working Java, React, SQL and Python | Writing Best Coding Practices, Clean Code and Software Engineering