Smart data for helping in investment choice: Part I : Collect data — scrapping

ZAIDI Houda
9 min readOct 20, 2017

--

There is some time I’m looking for a personal project that allows me to improve my expertise along the data chain value. I wanted to develop a project encompassing different skills and tools from the end to end bigdata treatment chain.

I’ve seen all those people who want to buy / sell automobiles or properties and I thought I could do something fun and useful about it. My idea was to collect many properties and data from auto sales listings. In this way, I can provide many analyzes on this data. For example, I can give the estimate of the price of your car or your home, the price per square meter by region, offer some recommendations for investors.

I will divide this post in a series of articles, with the following plan (subject to change along the road):

  1. Part I: Collect data — scraping
  2. Part II: Store data — cassandra
  3. Part III: Cleaning & exploration data
  4. Part IV: Web App that use a simple model in Python to estimate the properties price depending on its location or condition.
  5. Part V: Improving the model with additional features and a better algorithm
  6. Part VI: Setup of a chatbot to chat with this App in natural language

Part I: Collect data — scraping

Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting, etc.) is a technique for extracting large amounts of data from websites and save the extracted data to a local file or to a database.

Scrapping is a very interesting tool for parsing and retrieving data hibergated in a website. A small visit on the internet shows that python is a very powerful tool to do scrapping but also that there are two python packages allowing to do this very efficiently:

  1. BeautifulSoup
  2. Scrapy
BeautifuSoupe vs Scrapy

I will talk about the features of Scrapy and BeautifulSoup, compare them, and decide which one is better for my project.

  1. BeautifulSoup

BeautifulSoup is a tool which help programmer quickly extract valid data from web pages, its API is very friendly to newbie developer, and it can also handle malformed markup very well. The doc of BeautifulSoup is very comprehensive you can get a lot of examples there and quickly learn how to use it. BeautifulSoup works fine on Python 2 and Python 3, so compatibility will not be a problem.

We will use lxml, which is an extensive library for parsing XML and HTML documents very quickly; it can even handle messed up tags. However, in most cases, BeautifulSoup alone cannot get the job done, you need use another package such as requests to help you download the web page and then you can use BeautifulSoup to parse the HTML source code. I will try in this article to scrapp the “leboncoin.fr” website to collect the sale ads of properties and vehicles.

The first step to scrap a webpage is to obtain the html contains of the website. Actualy, when we open a browser what will be happen? In fact the steps are:
USER → open google on browse → browser send request to google server → server receives request — process it → server returns response to browse → browse display the response server sent → google homepage opens.
Then the operations which happens are USER — -> Request — -> Server — -> Response — -> USER.

Request is a library that allows us to query the server of a Web site and return the response to the user. To install requests use pip install requests. The most commonly use of requests are : POST and GET.

To display the content of the html page, we use :

The statue of the request response will be :

  • 1xx informational
  • 2xx success
  • 3xx Redirection
  • 4xx Client Error
  • 5xx Server Error

More détails => see this article

With request we can choose our prefer browser. This mean when we open your website we want to use for example google-chrome or firefox… For that we can use useragent to fix some header parameters. If we don’t indicate this parameter, the servers know that a robot or programme is sending this request and will return the response. But there is some site which have a very strict policy and don’t allow programme to request the servers and fair the request. To resolve that we can fake a useragent to show the server that we are basiqualy a browser and not juste a piece of code. Then let’s do that:

To install fake_useragent Python package, you can simply use pip:

$ pip install fake-useragent

As you can see above, we have downloaded an HTML document. Beautiful soup is a veru powerful python package to parse this document, and extract the text from the html tags. You can think of a page of a html as having a tree structure. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document. We need before install bs4 and lxml Python package, you can simply use pip:

$ pip install bs4
$ pip install lxml

So BeautifulSoupe makes simple parsing and access to data on a website by some operations like :

  • formatted html data using the prettify method
  • select the elements at the top level of the page
  • get the p tag by finding the children of the body tag
  • extract a single tag, we can instead use the find_all method
  • extract text or href from tags

To take a look at the different features here is a tutorial for the webscrapping using beautifulsoupe and Requests. Although this tool is very rich but this is very limited for achievement of complex processing and exploration html data. In the next part of this section we’ll discuss scrapy which is based on xpath adress, which makes the parsing more fun

2. Scrapy

Scrapy is a Python package for large scale web data extraction. It represent a set of powerful tool s that we need to extract, process ans structured data from websites.

Scrapy installation can be made with simply using pip with the following command:

$ pip install scrapy

Now, you need to create a Scrapy project. In your Terminal/CMD navigate to the folder in which you want to save your project and then run the following command (project_scrapy is the name we want to offer to our project) :

$ scrapy startproject project_scrapy

In your Terminal, navigate to the folder of the Scrapy project we have created in the previous step. As we called it project_scrapy, the folder would be with the same name and the command should simply be:

$ cd project_scrapy

After that, create the spider using the genspider command and give it any name you like; here we will call it boncoinimmo. Then, it should be followed by the URL.

$ scrapy genspider boncoinimmo https://www.leboncoin.fr/ventes_immobilieres/offres/

Now, manually navigate to the scrapy project. You can see that it is structured as :

Let’s check the parts of the main class of the file automatically generated for our boncoinimmo Scrapy spider:

  1. name of the spider.
  2. allowed_domains the list of the domains that the spider is allowed scrape.
  3. start_urls the list of one or more URL(s) with which the spider starts crawling.
  4. parse the main function of the spider. Do NOT change its name; however, you may add extra functions if needed.

To parse our site it remains to develop this function (parse), instead of pass, to the parse() function. Now we need to focus on how to navigate the website. To find an HTML element you need to use the Chrome developer tools. Right click on the website and select “Inspect”. This will open a box on the right side of the Chrome browser. Then click on the inspect icon (highlighted in blue).

Next use the inspector cursor to click on a section of the website that you want to control. When you have clicked, the HTML that creates that section will be highlighted on the right. In the photo below, I have clicked on the bar of all ads by regions located by region all over france. Next right click on the HTML element, and select under “Copy” -> “Copy XPath”. Then if i want to extract the information of the furst ad i search the tag that contain the information in the html elements ( at the bottom of the previous page). if you select an element at the bottom of the page, its color changes to blue (like the next print screen. The next step is to make an xpath copy like the next print Screen. That XPath is the most important piece of information!

To return this information we just use this command

$ ad= response.xpath(search_xpath).extract()

Then we can print this information or process it. search_xpath is the extracted google xpath. response is simply the whole html source code retrieved from the page. Actually, “response” has a deeper meaning because ; however (response.body) will get the whole source code. Anyhow, when you use XPath expressions to extract HTML nodes, you should directly use response.xpath(). You can use this command to run your spider and store the scraped data to a CSV file.

$ scrapy crawl boncoinimmo -o result_one.csv

As we agreed, you first need to scrape all the wrappers from the page. So under the parse() function, write the following. Note thate selector is a powerful API that can be used for quickly selecting nested data :

Note that here you will not use extract() because it is the wrapper from which you will extract other HTML nodes. You can now extract the ads information from the wrappers using a for loop. You can extract the ad address and URL from the wrappers using the same for loop as follows:

The function parse_item will return the ad element from the webpage. Then we notice that we need to browse all the pages for that we have to make a recursion to the same function for the next page as :

If you are using a regular browser to navigate a website, your web browser will send what is known as a “User Agent” for every page you access. So it is recommended to use the Scrapy USER_AGENT option while web scraping to make it look more natural. The option is already in Scrapy settings.py so you can enable it by deleting the # sign. You can find lists of user agents online; select a recent one for Chrome or Firefox.

The option DOWNLOAD_DELAY is already there in Scrapy settings.py so you can just enable it by deleting the # sign. According to Scrapy documentation, “this can be used to throttle the crawling speed to avoid hitting servers too hard.” You can see that the offered number is 3 seconds; however, you can make them less or more. Still, this makes sense because there is another option that is activated by default which is RANDOMIZE_DOWNLOAD_DELAY and it is set from 0.5 to 1.5 seconds.

The final code is present in my github here.

If everything works you now have a lot of data to handle. We will firstly store this tada with cassandra NoSQL database. The tutorial will be available in my next post — the Part II !

Thank you very much for reading, and do not hesitate to post anything in order to improve this article.

--

--