How to scrape data from publically available sources?
A simple step-by-step guide to web scrape data.
The main motivation to write this article is to understand web-scraping 101 in detail. To make it more interesting, let’s go through process by implementing it to scrape a website.
When you try to scrape data from a website. First decide what do you want to analyze and what type of data and data elements you want to get from the website?
For example i want to extracts the type of foods in deliveroo? Deliveroo is food-delivery company similar to other popular apps like uber eats, zomato, etc. having its headquarters in London.
Now, there are 2 ways to scrape data:
- Scrape from the page alone (Using Beautiful Soup)
- Scrape and engage with the webpage automatically (using Selenieum)
These are the 2 popular packages used in python to scrape data. the best way to learn about these packages is to learn from the documentation. I think reading documentation of packages, in general, is the best available resource.
People do find reading packages bit hard intially, i myself forced myself to read from documentation as they are trustable and have all the information you need or have. You can follow or read article and videos available, as i find that they are helpful in understanding the usage of the package.
I would like to share a pro-tip i got from my mentor.
Try to read and learn from package documentations and technical papers. Reading from them is another skill you are developing all together.
If you are looking to learn how to read documentations or technical/academic papers. You can refer the attached articles.
Back to our topic! As said, this article will mostly foucs on using beautiful soup and hopefully after exploring selenium, i would be able to explain it in detail in other article.
What is Beautiful Soup?
In simple words, Beautiful Soup is a python package used to pull out data from HTML and XML files. It’s used for navigating, searching and pulling out data from websites.
Do you need to know HTML before scraping?
You don’t need to know about html in all, but it’s helpful to know the basic syntax of an html page, as it helps in navigating the right data you desire.
Some of the basic topics you need to learn:
- Headers
- Classes and ID (what are they and what are they used for?)
- Tables and lists (How are they represented in html?)
You can learn more from here in detail.
Difference between Find and FindAll function?
The 2 main functions that are vital in extracting data from any website.
find( )
- It returns when the searched element is found in the first page,i.e. in a list of items, the first elenent that satisfies the condition is diaplayed.
- Only, the first search is displayed as output.
- Find function syntax, find(tag, attributes, text, keywords, etc.)
findAll( )
- It can be used to get all elements that satisfy the condition. Mostly, list of all items that satisfy the condition.
- Print all the searches as an output.
- findAll(tag, attributes, text, limit, keywords) syntax.
So, intially we will be using 2 packages so first we need to install if you don’t have it in your local system.
!pip install BeautifulSoup4
!pip install requests
After installing, we need to import the required libraries.
import requests
import bs4 as bs
import pandas as pd
Before going further, it’s better to understand why we need these libraries.
- requests : It requests data from the website (server) and sends the response back to your local system (client) or the system from which you requested.
- BeautifulSoup(bs4) : As explained above, the data received by requests is then parsed from HTML. It helps to interact with a webpage same way you interact using developer tools.
- pandas (pd) : After extracting the required data, we can organise the data into a nice table for later analysis purposes.
So, after importing the required libraries, you need to assign the website (you want to scrape data from) to a variable or you can even use it with requests function.
I would request to go the website and right click on the webpage, click on ‘inspect’ to explore the html code of the webpage.
url = 'https://deliveroo.co.uk/'
results = requests.get(url)
In layman terms, you will be asking the server (with the link you provide) to extract the data from the whole static website page.
If you want you can explore what data you received after requesting from the website.
You can see from above output that we are getting the whole data with HTML data tags and various links.
Now to acess the data we obtained we use BeautifulSoup to parse HTML and then access and extract the data with the developer functions.
soup = bs.BeautifulSoup(results.content, "html.parser)
Here comes the pivotal step. I used to find intially trying to understand which html tag i needed to extract the data.
My goal is to extract all the different menus available on the website form a dataframe later. As simple as it sounds.
- First is to inspect the page a find the right HTML tag under which we can see the desired list.
- Found that all the list under the a common class with ‘span’ tag.
menu = soup.find_all('span',class_ = 'HomepageFeaturedCollectionTile-dca78627621916a1')
The above will list-out all the tags with class name as mentioned.
From the above output the next step is to extract only the text from without the html tags.
list_of_menu = []
[list_of_menu.append(types.text) for typesin menu]
print(list_of_menu)
So, i will be getting a list of menu types available in deliveroo.
Now if you want have in dataframe format.
df = pd.DataFrame(list_of_menu)
df.columns = ['Menu_items']
print(df)
That’s it! The list may not be useful, but the steps remain the same for any type of data you want to extract from static webpages.
Now if you want to acess with entering and clicking the cookies automatically, we can use Selenieum package.(That’s a topic for another Article)
Hope you were available to find this hands-on-tutorial useful and do feel free to comment if you didn’t understand.