How to scrape the web with Python

Alejandro Tagalos
Geoblink Tech blog
Published in
7 min readMay 5, 2022

Python is one of the most used languages out there and you can find an excellent library for every task you want to achieve.

To those who do not know what scraping is, it is the process of data collection from a web page. Imagine you have a list of restaurants in Madrid with name, address and average rating, and you need to have all this data into a CSV in your computer for future reference, or maybe you need that info to feed your prediction algorithm. To get the data, you have two approaches:

  1. You can manually copy paste all the data into your CSV.
  2. You can use an automated software that reads the code and extract the data you are interested in.

If you have 40 restaurants, it is feasible to do the task manually. However, what happens when you have 4000? In that case, it is necessary to automate the task in some way, and that is the purpose of a web scraper: to do a data extraction automatically and faster than manually.

There are plenty of scraping libraries for python, but for the purpose of this tutorial we will use BeautifulSoup4. We chose this library because it is intuitive and easy to use.

Once the technology to be used has been decided, we have to choose the data to scrape, so in this tutorial we will scrape the Geoblink team! You can find all the information to extract in this URL:

https://www.geoblink.com/es/sobre-geoblink/

Setting up the environment

First things first, we need to install the BeautifulSoup4 dependency. BeautifulSoup is a scraping library that allows us to extract information from webpages in a fast and easy way. You can search, navigate and modify the HTML tree using different parsers. If you do not know what a parser is, keep reading because we will explain it later.

Here at Geoblink we use Poetry to install dependencies. As you can imagine, Poetry is a python packaging and dependency management software. With just one command line, it installs all the software your application needs to run. It also takes care to install the dependencies version you choose in a config file, not to break your application and to ensure compatibility between all your 3rd party dependencies. You can learn more about Poetry in this link.

However, for simplicity and for the sake of this tutorial, we will use pip.

pip install beautifulsoup4

We are going to use another python library called “requests”. Requests is an HTTP library that allows us to make HTTP requests. This library gained popularity due to its simplicity, and it is the preferred library when we need to make HTTP requests. Specifically, we need it to make a GET request to the URL where the data is. Since it is a third-party library, we need to install it with pip.

pip install requests

Additionally, we need to install a parser. An HTML parser is a piece of software that can analyze, process and modify and HTML code and we will use it to extract the data we need from the HTML. In this tutorial, we will use lxml parser so we need to install it in the usual way:

pip install lxml

Lastly, to export the data to a csv file we will use pandas. For that, it is required to install the pandas library

pip install pandas

Right now we have all the required packages installed in our workstation, and the next step to set it all is to import the in our script

import requestsfrom bs4 import BeautifulSoupimport pandas as pd

Scraping the data

Once we have all the libraries installed and imported, we need to make a GET request to the URL where the data is and pass the contents to a BeautifulSoup constructor that will return a BeautifulSoup object.

r = requests.get(‘https://www.geoblink.com/es/sobre-geoblink/')soup = BeautifulSoup(r.text, ‘lxml’)

Once we have the source code of the webpage, we need to parse it and extract the information we are interested in. For that, we need some information about the webpage. Specifically, we need to get the selectors that identify the content to extract. For those who do not know what a selector is, it is an identifier of an HTML element. In this case we will identify the element with a CSS class, but there are other kinds of selectors that are out of the scope of this tutorial.

To identify the selector, we open the developer tools in Google chrome and select the portion of the webpage where the data is. At a glance, we conclude that the information is stored in HTML div blocks, all having the class “col__3”.

Upon further inspection, we see that, inside each div, the information we need is inside another div block, and the selectors are as follows:

  • Employee name has the selector “.title-cn”
  • Employee position has the selector “.position-cn”
  • Employee description has the selector “.descripction-cn”

BeautifulSoup is a really easy library and provides us a .select() method that returns an iterable element. Thanks to that, we can loop over all the employees in the webpage with this simple line of code

for div in soup.select(“.col__3 .Academy-item”):

We added the additional “.Academy-item” selector because the “.col__3” selector by itself was returning other containers from the webpage, so we needed to be more specific. Diving into the HTML, we realized that each employee data was inside a div with class “.col__3”, and each of these div was, in turn, inside another div with class “.Academy-item”. Adding the additional selector resolved our problem and now the loop is iterating over the correct data.

The next step is, for every iteration, we are going to extract the name, position and description of the employee. We are going to use the method .find() of the BeautifulSoup library. Since we have to extract the contents of the element, we cannot use the .select method() because it returns a resultSet object with no “text” attribute. Instead, the find method allows us to extract the data inside the element easily.

Each row of data is going to be saved into an array. We are going to create a matrix where each column contains the different data points and each row contains the different employees.

data = []data.append([div.find(“strong”, class_=”title-cn”).text,div.find(“small”, class_=”position-cn”).text,div.find(“div”, class_=”descripction-cn”).text])

Once we have created the array of arrays, we are going to create a pandas dataframe using the data variable.

df = pd.DataFrame(data, columns = [‘Nombre’, ‘Posición’, ‘Descripción’])

Once we have all the data in a dataframe, the next and last step is to export the dataframe into a csv object. Thanks to pandas, this task can be easily performed with the following line of code.

df.to_csv(‘geoblink_employees.csv’, encoding=’utf-8', index=False)

We added the enconding=utf-8 parameter so as not to have problems with Spanish letters and accents.

The complete code is:

import requestsimport pandas as pdfrom bs4 import BeautifulSoupr = requests.get(‘https://www.geoblink.com/es/sobre-geoblink/')soup = BeautifulSoup(r.text, ‘lxml’)data = []for div in soup.select(“.col__3 .Academy-item”):data.append([div.find(“strong”, class_=”title-cn”).text,div.find(“small”, class_=”position-cn”).text,div.find(“div”, class_=”descripction-cn”).text])df = pd.DataFrame(data, columns = [‘Nombre’, ‘Posición’, ‘Descripción’])df.to_csv(‘geoblink_employees.csv’, encoding=’utf-8', index=False)

The result of our scraping can be seen in the following screenshot:

How we do it at Geoblink

This tutorial shows how you would do a basic scraping on your own. However, at Geoblink we build proprietary tools to scrape data at scale.

This approach is OK when you need to do a one time task. But when the data extraction is just one step more in your data processing pipeline, it arises the necessity to build custom software.

After the scraping, we perform many checks on the data to ensure the quality of the information extracted, and after that we format the data in a correct way that can be used by the next piece of software in the pipeline. All that has to be done in an integrated way not to incur in errors. For that, our development team has come up with a complex and custom scraping solution that fits all our needs and is able to incorporate new changes with the fast our clients demand.

But even though we develop complex software to suit our needs, the basics explained in this article remain the same.

Conclusion

BeautifulSoup is one of the most popular python scraping libraries and we exposed how fast and easy it is to scrape data from any webpage on the Internet using it. With just a few lines, we got an entire dataset with names, positions and relevant information of every person.

In this case, since we only scraped 44 persons, you can arguably say that the manual approach is faster, but the utility of the scraping is seen when we need to extract huge datasets when the manual approach is not feasible.

Note that scraping is not an illegal technique. Bear in mind that the data we are collecting is publicly accessible and you are not incurring in any illegal practice when you collect public data.

The potential of this methodology is immense and we are just a few lines away from scraping huge datasets to feed our models and make better predictions.

--

--