How to scrape a website
If you came to this page and you feel like extracting content from a webpage for any good (and legal, be very careful with this as there are several lawsuits currently in courts) reason, you have landed to the right place: here is our Web Scraper tutorial! At Geoblink, culture is one of our key values…Victor Hugo, Pedro Almodovar, Marco Asensio… Our Geoblinkers like to scrape the list of the upcoming cultural events so we don’t miss any of them even during the World Cup. Here, we offer you the opportunity to be a Geoblink monkey and scrape some cultural data
Ok, imagine that you want to know which cultural events are going to happen in Madrid during the next month. You can find every culture and entertainment events sorted by district on the webpage https://www.madridcultura.es/. At Geoblink, we like to use the incredible WebScraper to make our lists of cultural events.
Basically, WebScraper is a Google Chrome Extension. You can install it here: https://chrome.google.com/webstore/detail/web-scraper
Ok, let’s dive deep into the Madrid cultural experience:
1) Open WebScraping
Open the Developer tools of Google Chrome Browser in the customize panel or just type Ctrl+Shift+i. Then click on the Web Scraper Headless on the top of the banner.
2) Create a Sitemap
Click on the option Create new sitemap, fill a Sitemap name and then select a Start Url…. And let’s scrape
3) Add new selector
Webscraper works as a tree graph, where you have to create nodes that contain more nodes with information. There are different types of selectors, each one works with various characteristics and, depending on the object that you want to choose, you are to select a specific type….which may be a bit tricky at times. The most common types of selectors are Text, Links, and Element (that’s what we are gonna use for this tutorial) but, sometimes it’s necessary to use Element Click or Element Attribute.
We’re going to create this selector graph by choosing a Selector Link that matches each district of Madrid. Click on the button ‘Add new selector’ and complete the following options:
- Id: Name of the node ( for this example ‘district_links’)
- Type: Select the kind of node ( chose ‘Link’)
- Selector : click on the button Select and choose the link that you want to go. If there are more than one you can select all the links by selecting various links. Awhen you are done selecting, click on…
- Done selecting! 🙂
- In order to save the selection. If there is more than one link, you have to match the Multiple option.
To make sure you’re on track, glance at the Data preview and the Element preview. If you can’t see what you expect, well first, it sucks, and then you should scroll up to improve things 🙂
To save all click on the Save selector button.
4) Create a selector element
Click on the saved selector. It appears that you can create more nodes inside. If you go inside one district link you can see that there is more than one activity on the main page. For this example it’s necessary to have an element selector that compiles all the information of the different events.
To create that, it’s the same as the previous example, but now you have to choose Element in the type option. To select all the elements of the page, you have to click on the Select button, mark all the boxes and then choose the Multiple option.
5) Define the information of the event
Again, if you click inside the previous selector, you can create more. At this point, you want to define the different information of the event, like name, date, place. To do that, create one selector for each data choosing Text on the selector type. You can also create a Link selector to obtain more information regarding the event (as the address and the event web page).
If you have followed the previous steps, you have built a tree graph for the main page of every district. But, if you look at the top of the page, you can see that there are more than one page per district. That’s the reason why pagination is needed.
To create accurate pagination, you have to go to the beginning of the tree, to the same step, then choose Element Selector, then create a Link Selector and pick on the different pages.
After that you need to edit the Event Element and modify its Parents Selector by selecting the previous node and the pagination selector that you have just created.
7) Let’s Scrape
You have finished building the Sitemap, so let’s scrape and see what happens. Click on the button Scrape and choose a Request Interval and Page load delay. The default values work, but sometimes it’s necessary to modify those. So, click on Start Scraping and a pop up should appear.
It’s recommended to click on the Browser button to see how it works and to check everything pans out well.
8) Export the results
The pop up window will close at the end of the scrape. So when it happens, you can download the data. Go to Export data as CSV, and click on Download now! A CSV will be downloaded with all the data of the cultural events in Madrid. After that, it’s recommended that you clean the data a little bit because some columns with links are added in the process. Also, you can download the Sitemap that you have just created.
Eventually you’ll get a nice CSV with almost 600 differents events for the next month in Madrid!
Alejandro Cantera — Data Acquisition
Nicolas Planchon — Data Acquisition