Real Estate Market, Web-Scrapping and EDA using Python

Published in

Analytics Vidhya

9 min readMar 7, 2021

In this guide we will have a look at the property market of Tirana, Albania. The article will present a guide of web-scrapping using python and to perform exploratory data analysis, thus extracting valuable insights from raw data.

About 5520 Apartments’ (On sale and for rent) data were extracted from Century21’s website. These data are property of Century21st and can only be taken with authorization (which I have), the purpose of this study is purely educative so please make an ethical use of this guide and always ask for authorization when scrapping data from other websites.

You are not a drop in the ocean. You are the entire ocean, in a drop. Rumi

Data Acquisition

Before we start the web-scrapping process, lets have a quick look at the website from which apartments’ data will be extracted. It can be found here. We are interested for property type “Apartment” and city “Tirana”. After applying the filters the properties are shown as below 12 properties for each page, there are 460 pages (until now) and we will go through each of them using a python script to extract the data.

Below are outlined the information that needs to be extracted for each apartment (if it exists). We will demonstrate the process of extracting this information for one single property and then present the code which goes through all properties on all pages (460*12 = 5520 Apartments).

**The desired information for each Property**

Imports

# Requests is an elegant and simple HTTP library for Python
import requests
# pandas is a powerful open source data analysis tool
import pandas as pd 
# Beautiful Soup is a library for Data Scrapping
from bs4 import BeautifulSoup
# tqdm package to have a progressive bar during data extraction
from tqdm import tqdm
# Numpy is a package for scientific computing of arrays
import numpy as np
# A module which contains statistical functions
from scipy import stats
# Opens the passed URL in a browser
import webbrowser
# String Manipulation
import re# Plotting Packages
import matplotlib.pyplot as plt
from pyecharts.charts import Bar
from pyecharts import options as opts
import seaborn as sns

Web-Scrapping

First we go to the webpage and right-click > inspect, to display the source code. Move the cursor across the page elements and locate the property element which contains our data, it has a class of “col-lg-3 col-md-4 property”.

**Inspect Element to Discover Class of Property Element**

Now to get that source code of this property element we pass the link of the webpage in requests, and make sure you also pass these headers. Then, the content (r contains the source code of the whole page) is passed to BeautifulSoup while we specify the element “div” and class “col-lg-3 col-md-4 property”, thus creating properties, a list which contains the source code of each property element. We can very it by printing the length of the list and content of the first element.

r = requests.get(f"https://www.century21albania.com/en/properties? display=grid&type=Apartment&city=Tirana&page=1", 
                  headers={'User-agent': 'Mozilla/5.0 
                           (X11; Ubuntu; 
                            Linux x86_64; rv:61.0)   
                            Gecko/20100101 Firefox/61.0'})properties = BeautifulSoup(r.content, "html.parser").find_all("div", {"class":"col-lg-3 col-md-4 property"})

**Source code corresponding to the first property on the list**

Above is shown part of the code corresponding to the first property of the list, while there are some highlighted elements, like the Property URL, Price, Title and Address of the card. Lets extract this info:

**Extracts Property URL, Price, Title and Address**

We first, create a empty list to store the desired information, then we loop through the properties list which was previously created and extract the information by feeding the appropriate class names on the find method. For instance to extract the URL, we find all HTML elements of type link “a” and get the second element as the second link corresponds to the specific property element. Similarly we locate the price element class=”text-primary mb-2", title element using class=”card-title" and address element using “class=card-subtitle mt-1 mb-0 text-muted” , then we extract the text of these paragraphs. Using try-except will prevent error if case of missing information.

Now, to get the rest of the information we have to go to the each particular property webpage using the Property URL that we got earlier. For instance for this property, we will extract the agents’ information: Seller (name), Email and Phone number, under the card with “class=card card-list”.

And for each property we will extract the given detailed information Gross Area, Interior Area, Bedrooms, Baths, Type, Status, Availability, Website Views, Documentation, Living Rooms, Floor and Land Area, on this section under the “class=card-body col-md-12”.

Below is the whole script which enables us to move from page 1 to 460 and for each page extract the information for 12 properties. First we get the property card element, then the property URL, we move to that URL and extract the agents’ information and the detailed property features. Then we extract the price, title and address from the main page. We use tqdm package to visualize the for loop with a progress bar, it takes a bit more than one hour to finish. We put the data in DataFrame.

We put the data on a DataFrame and it looks like this:

# Convert the list of uneven dictionaries to Pandas Dataframe
df = pd.DataFrame.from_records(all_properties)

Rename the columns according to good practices and now we are ready to start with the cleaning and exploratory data analysis.

EDA and Cleaning

Property URL

Agent

We can see that there are 210 unique agents of the company. We can visualize the top 20 agents by their number of properties. The top agent has 149 properties on the platform for sale or rent.

**Top 20 Agents by number of Properties**

Email

Email Field as you can see in the DataFrame is encrypted. This is because they are of a class=”cf_email” Which Means emails are CloudFlare Protected. We need to apply a decryption function as follows:

Phone Number

# To convert the data to uniform format we remove the '+' sign
df['phone'] = df['phone'].str.replace('+','')

Price

We need to remove the extra symbols and convert the amount from string to integer.

Title

Using the title description we can use it to split the dataset of apartments to be rented and apartments for sale, for this we have two functions which control if the title contains the word “rent” or “sale” and synonymous words of each category. We add two columns encoded with 0 or 1 depending if the apartment belongs to the category. Drop the rows which belong to both or to None of these categories (as there are apartments offered both for rent or sale), we are left with 5'185 properties.

# Drop rows where the apartment is for sale and for rent or with a ambigious title:
df = df.drop(df[(df['sale'] == 1) & (df['rent']== 1)].index)
df = df.drop(df[(df['sale'] == 0) & (df['rent']== 0)].index)
df = df.reset_index(drop=True)

Address

We remove the special characters and strip the trailing spaces. Then we can illustrate which zones have more apartments to offer for sale/rent. We can see that “Don Bosko” has a higher offer of apartments, followed by “Fresku” a zone which has many new residence buildings.

**Number of Apartments by Zones (Top 10)**

Gross Area

We see that the mean gross area of apartments is 105.9 square meters. Below we illustrate the apartments by their gross area by splitting it into segments from 50–170 square meters:

Interior Area

We remove non numeric characters, and convert the string to float. The total interior area of all apartments together is 457'774 Square Meters.

Baths

As shown below we can count the number of apartments by their number of baths and we see that 57% (2'823 apartments) have one bathroom. There is one property which has 133 bathrooms, after checking it out, we see that this is mistakenly entered by the agent. Hence we drop the problematic rows such as this one.

Status

After properly arraigning the apartments by their status, we encode the two main categories “new” and “used”. The majority of apartments 66% are new as in Tirana City the construction industry has been very active these last years.

Availability

If we split the data by the availability and group them by rent or sale, we can see that 17% (889) of apartments are sold, and out of 2'212 available ones, 1581 (71%) are offered for sale and not for rent.

Website Views

As shown above, the public is more interested to rent an apartment, as shown by the distribution of website views. Out of 2'266'995 total views, the majority of 59% are directed towards apartments offered for rent. Below is an chart showing the top agents by website views. We can see that the agent “Oltion Lulja” even though has only 107 apartments as opposed to “Eriola Kurti” with 147 apartments, he still managed to have more views on his properties 61'089 Views.

Documentation

This column gives information on the ownership certificates for the properties. These is a considerable amount of apartments 26% which contain no info (most probably indicating a lack of documentation of the property)

Floor

This column contains the floor number of the apartment. Number of apartments above floor one is 3'602, that indicates that the offered apartments are mostly on high residence buildings, also illustrates the average landscape in terms of building height of the city.

Price

Mean Apartment sale price is 105'792 EUR, while 50% of apartments are between 75'000 EUR and 130'000 EUR. There are some outliers which we can drop as follows:

# Remove Price which is two standard deviations away, basically top 5 percent, outliers. 2390 Left
df_sale = df_sale[(np.abs(stats.zscore(df_sale['price'])) < 2)]# Remove The lower Outliers, 2374 Left
df_sale = df_sale[df_sale['price']>20000]