House rental — the Data Science way Part 1: scrape it all with python and BeautifulSoup — UPDATED

Published in

Analytics Vidhya

7 min readDec 27, 2019

Last year I moved from my original house to a new city and changed my job. Everything moved so quick and I had just a couple of weeks to find an accomodation before starting my new job. In this rush I didn’t had enough time to understand the real estate market in the city and ended up choosing the accomodation that better balanced between distance from job and services.
But…well…the accomodation is quite small and I guessed I was paying too much for that house. But I was just guessing!

Then…what if I can find it out?
So…let’s go back to what I studied of machine learning.

Build your own dataset

This is a pretty common and standard example of machine learning application: regression on house prices to find the real cost of it. You can find examples of it wherever you want on internet. Except they usually use datasets coming from…well…wherever. I need fresh prices coming from the city I want and I want it to be updatable during months. There is just a way to do so: scraping!

I live in Italy, Turin to be exact. In Italy the biggest website for collecting all the announces for renting houses or buying them is www.immobiliare.it

Immobiliare.it is a collector of annouces every agency in Italy can use to show the building they are handling, so it’s probably the best way to get an idea of the real estate market in a particular city.
And now it’s time to get our hands dirty.

Make the soup — UPDATED

UPDATE: during summer immobiliare.it’s developers updated the website witha new layout and a little harder html to scape. I’m gonna update the code posted to reflect changes.

What we are gonna do is to go on the website, navigate to our city’s main page, collect a list of all the areas (districts) of the city and scrape every announce published in that area.
The toolset:

import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm_notebook as tqdm
import csv

Now we’re ready to go!

def get_pages(main):
    try:
        soup = connect(main)
        n_pages = [_.get_text(strip=True) for _ in soup.find('ul', {'class': 'pagination pagination__number'}).find_all('li')]
        last_page = int(n_pages[-1])
        pages = [main]
        
        for n in range(2,last_page+1):    
            page_num = "/?pag={}".format(n)
            pages.append(main + page_num)
    except:
        pages = [main]
        
    return pagesdef connect(web_addr):
    resp = requests.get(web_addr)
    return BeautifulSoup(resp.content, "html.parser")def get_areas(website):
    data = connect(website)
    areas = []
    for ultag in data.find_all('ul', {'class': 'breadcrumb-list breadcrumb-list_list breadcrumb-list__related'}):
        for litag in ultag.find_all('li'):
            for i in range(len(litag.text.split(','))):
                areas.append(litag.text.split(',')[i])
    areas = [x.strip() for x in areas]
    urls = []
    
    for area in areas:
        url = website + '/' + area.replace(' ','-').lower()
        urls.append(url)
    
    return areas, urlsdef get_apartment_links(website):
    data = connect(website)
    links = []
    for link in data.find_all('ul', {'class': 'annunci-list'}):
        for litag in link.find_all('li'):
            try:
                links.append(litag.a.get('href'))
            except:
                continue
    return linksdef scrape_link(website):
    data = connect(website)
    info = data.find_all('dl', {'class': 'im-features__list'})
    comp_info = pd.DataFrame()
    cleaned_id_text = []
    cleaned_id__attrb_text = []
    for n in range(len(info)):
        for i in info[n].find_all('dt'):
            cleaned_id_text.append(i.text)
        for i in info[n].find_all('dd'):
            cleaned_id__attrb_text.append(i.text)comp_info['Id'] = cleaned_id_text
    comp_info['Attribute'] = cleaned_id__attrb_text
    comp_info
    feature = []
    for item in comp_info['Attribute']:
        try:
            feature.append(clear_df(item))
        except:
            feature.append(ultra_clear_df(item))comp_info['Attribute'] = feature
    return comp_info['Id'].values, comp_info['Attribute'].valuesdef remove_duplicates(x):
    return list(dict.fromkeys(x))def clear_df(the_list):
    the_list = (the_list.split('\n')[1].split('  '))
    the_list = [value for value in the_list if value != ''][0]
    return the_listdef ultra_clear_df(the_list):
    the_list = (the_list.split('\n\n')[1].split('  '))
    the_list = [value for value in the_list if value != ''][0]
    the_list = (the_list.split('\n')[0])
    return the_list

Aaaannd here we go!
We just defined 5 functions:

connect(): is used to connect to the website and download the raw html code from it;
get_areas(): it scrapes the raw html to find districts. For every district there is a unique link that filters announces relative to just that district;
get_pages(): for every district “main page”, it looks for how many pages of announces are available and creates a link for every single page;
get_apartment_links(): for every single page found, it looks for every announce and collects every link
scrape_link(): this function is the proper scraping process of announces

At the end of the execution we will have the link of every single announce with the indication of the district of origin.

## Get areas inside the city (districts)

website = "https://www.immobiliare.it/affitto-case/torino"
districts = get_areas(website)
print("Those are district's links \n")
print(districts)## Now we need to find all announces' links, in order to scrape informations inside them one by oneaddress = []
location = []try:
    for url in tqdm(districts):
        pages = get_pages(url)
        for page in pages:
            add = get_apartment_links(page)
            address.append(add)
            for num in range(0,len(add)):
                location.append(url.rsplit('/', 1)[-1])
except Exception as e:
    print(e)
    continue
        
announces_links = [item for value in address for item in value]## Just check it has some sense and save itprint("The numerosity of announces:\n")
print(len(announces_links))
with open('announces_list.csv', 'w') as myfile:
    wr = csv.writer(myfile)
    wr.writerow(announces_links)

Now we have the link of every announce in that particular city, so let’s look for the treasure.

The Chef at work!

## Now we pass all announces' links do the scrape_link function to obtain apartments' informationsdf_scrape = pd.DataFrame()
to_be_dropped = []
counter = 0
for link in tqdm(list(announces_links)):
    counter=counter+1
    try:
        names, values = scrape_link(link)
        temp_df = pd.DataFrame(columns=names)
        temp_df.loc[len(temp_df), :] = values[0:len(names)]
        df_scrape = df_scrape.append(temp_df, sort=False)
    except Exception as e:
        print(e)
        to_be_dropped.append(counter)
        print(to_be_dropped)
        continue## Eventually save useful informations odtained during the scrape processpd.DataFrame(location).to_csv('location.csv', sep=';')
pd.DataFrame(to_be_dropped).to_csv('to_be_dropped.csv', sep=';')

This code passes through every announce and scrapes information out of it, collecting every information in 2 lists: nomi and valori. The first one has the feature’s name, the latter contains the value.
At the end of the scrape process, we finally have a Pandas.DataFrame in which are stored, in every row, a different announce with its characteristics and the district to which it belongs.
Just check out fine dataframe has sense.

print(df_scrape.shape)
df_scrape[‘district’] = location
df_scrape[‘links’] = announces_links
df_scrape.columns = map(str.lower, df_scrape.columns)
df_scrape.to_csv(‘dataset.csv’, sep=”;”)

Now we have a DataFrame that contains 24 columns (24 features) with which we can train our regression algorithm.

Now, before putting everything the in pot, we must clean up and slice the ingredients…

Slicing, cutting and cleaning up

Unfortunately the dataset isn’t exactly….well…ready.
What we collected is often dirty and we can’t work on it. Just to mention some examples: prices are stored as strings in the form of “600 €/month”, houses with more than 5 rooms are listed as “6+” , and so on.

So here we have the tool to “Clean them all” (‘Gollum, Gollum!’)

df_scrape = df_scrape[['contratto', 'zona', 'tipologia', 'superficie', 'locali', 'piano', 'tipo proprietà', 'prezzo', 'spese condominio', 'spese aggiuntive', 'anno di costruzione', 'stato', 'riscaldamento', 'climatizzazione', 'posti auto', 'links']]def cleanup(df):
    price = []
    rooms = []
    surface = []
    bathrooms = []
    floor = []
    contract = []
    tipo = []
    condominio = []
    heating = []
    built_in = []
    state = []
    riscaldamento = []
    cooling = []
    energy_class = []
    tipologia = []
    pr_type = []
    arredato = []
    
    for tipo in df['tipologia']:
        try:
            tipologia.append(tipo)
        except:
            tipologia.append(None)
    
    for superficie in df['superficie']:
        try:
            if "m" in superficie:
                #z = superficie.split('|')[0]
                s = superficie.replace(" m²", "")
                surface.append(s)
        except:
            surface.append(None)
    
    for locali in df['locali']:
        try:
            rooms.append(locali[0:1])
        except:
            rooms.append(None)
    
    for prezzo in df['prezzo']:
        try:
            price.append(prezzo.replace("Affitto ", "").replace("€ ", "").replace("/mese", "").replace(".",""))
        except:
            price.append(None)
            
    for contratto in df['contratto']:
        try:
            contract.append(contratto.replace("\n ",""))
        except:
            contract.append(None)
    
    for piano in df['piano']:
        try:
            floor.append(piano.split(' ')[0])
        except:
            floor.append(None)
    
    for tipologia in df['tipo proprietà']:
        try:
            pr_type.append(tipologia.split(',')[0])
        except:
            pr_type.append(None)
            
    for condo in df['spese condominio']:
        try:
            if "mese" in condo:
                condominio.append(condo.replace("€ ","").replace("/mese",""))
            else:
                condominio.append(None)
        except:
            condominio.append(None)
        
    for ii in df['spese aggiuntive']:
        try:
            if "anno" in ii:
                mese = int(int(ii.replace("€ ","").replace("/anno","").replace(".",""))/12)
                heating.append(mese)
            else:
                heating.append(None)
        except:
            heating.append(None)
   
    for anno_costruzione in df['anno di costruzione']:
        try:
            built_in.append(anno_costruzione)
        except:
            built_in.append(None)
    
    for stato in df['stato']:
        try:
            stat = stato.replace(" ","").lower()
            state.append(stat)
        except:
            state.append(None)
    
    for tipo_riscaldamento in df['riscaldamento']:
        try:
            if 'Centralizzato' in tipo_riscaldamento:
                riscaldamento.append('centralizzato')
            elif 'Autonomo' in tipo_riscaldamento:
                riscaldamento.append('autonomo')
        except:
            riscaldamento.append(None)
    
    for clima in df['climatizzazione']:
        try:
            cooling.append(clima.lower().split(',')[0])
        except:
            cooling.append('None')
    
    final_df = pd.DataFrame(columns=['contract', 'district', 'renting_type', 'surface', 'locals', 'floor', 'property_type', 'price', 'spese condominio', 'other_expences', 'building_year', 'status', 'heating', 'air_conditioning', 'energy_certificate', 'parking_slots'])#, 'Arredato S/N'])
    final_df['contract'] = contract
    final_df['renting_type'] = tipologia
    final_df['surface'] = surface
    final_df['locals'] = rooms
    final_df['floor'] = floor
    final_df['property_type'] = pr_type
    final_df['price'] = price
    final_df['spese condominio'] = condominio
    final_df['heating_expences'] = heating
    final_df['building_year'] = built_in
    final_df['status'] = state
    final_df['heating_system'] = riscaldamento
    final_df['air_conditioning'] = cooling
    #final_df['classe energetica'] = energy_class
    final_df['district'] = df['zona'].values
    #inal_df['Arredato S/N'] = arredato
    final_df['announce_link'] = announces_links
    
    return final_dffinal = cleanup(df_scrape)
final.to_csv('regression_dataset.csv', sep=";")

This function deals with dirty data in various ways, depending on the dirtness type. Most of them have been cleaned with Regex (them be blessed) and in general with string related tools. Take a look at the script to see how it works.
PS: few errors on the dataset may still remain. Treat them as you prefer.

Prepare the pot!

Here we are! Now we have an hand made dataset on which we can work with ML and regressions. In the next article I’m gonna explain how i dealt with all the ingredients.

Stay tuned!

Link to GitHub https://github.com/wonka929/house_scraping_and_regression

This article is the first part of a tutorial. You can find the second article at this link:
https://medium.com/@wonka929/house-rental-the-data-science-way-part-2-train-a-regression-model-tpot-and-auto-ml-9cdb5cb4b1b4

UPDATE: while dealing with immobiliare.it’s new website, i decided to also update the regression methodology.
This is the new, updated article you can find online:
https://wonka929.medium.com/house-rental-the-data-science-way-part-2-1-train-and-regression-model-using-pycaret-72d054e22a78