How to Webscrape with Requests, Selenium, and Beautifulsoup in Python
This article is mainly for beginners at webscraping, and should help with thinking about how to scrape something specific off a website with the example below. The best way to learn methods on grabbing specific HTML tags is to find a website you frequently go to and try to automate something with the text you can grab from that site. Normal advice but; read the official docs when trying to pinpoint a HTML tag, then google carefully…
I typically just have test files for specific HTML tag targets that are hard to grab, and a main file with the rough draft of the entire scraper. Once the rough draft is done and the scrapes are outputting the correct format I want; start to abstract the code more. Once it works, just refactor it more if you have time.
sudo pip install beautifulsoup4 Beautifulsoup is great at parsing HTML data, it’s python methods are very intuitive when navigating a HTML tree. Very good for pinpointing specific HTML tags if you want attributes. It also makes parent tags iterate-able, so you can think about them in loops more. A
soup object pretty much changes each HTML tag into an object, so targeting a specific text or attribute becomes a lot easier.
Let’s create an example: We’ll use toscrape.com for an example practice. Let’s say we want to scrape all of non links in
“Endpoints” in the
“Quotes” section. I’ll be creating these steps as if it was live, since this is a new website and data target for me to scrape.
Step 1, Setup your workspace: You should have a browser on the side with inspect mode on. Your scraping code should be on one panel, and a place to run that code. I find this to be the most efficient way to testing a target quickly:
Step 2, Get the web page data: Try Requests, if I can’t extract a 200 response with the data I want, then I will go to Selenium. The main thing is getting that response data stored in a variable so that we can manipulate it to output the specific text that we want.
Also make sure it is the actual HTML you want:
You can also write it to a file in-case you don’t want to send multiple requests every time you test your output:
Step 3, What are we trying to output: We said earlier that we just want the list under
“Endpoints” and it’s text only (non-links). So let’s inspect for a way to get that. We can see that there are two tables, but we only want one. So we can grab all the tables with
alltables = soup.find_all("tables") . Next we can see the title of the table is pretty uniformed in the code. So next we can target that title with moving to the next next tag’s text by
.next_element.next_element.text . We can just add this code into the loop so we check every table on the page. In this loop, we’ll make a conditional if the next next’s text is equal to
"Endpoints" and print something if it finds it:
Lets run it:
Great, it found it. Now we can loop through each tag in this
table to see how we can extract only the text. We can inspect in the browser and see that each
table has 2
td . So we can use the
find_next() method to go to the 2nd
td of each
tr . Lets try that:
Perfect, we now have the text for the most part. Let’s just take care of the
AttributeError quickly and store all the text into a list, and also remove the first item from the list as it is coming from the title of the
Now we have a list of the data we want. It’ll be easy to format this into anything you would want. For example, if you wanted to write these in between certain lines of a README generator, or if you wanted each item of this list to be inserted after every number of a string you have. This is generally how I would go about scraping data; you start off by trying to get the entire page first, and then slowly cutting away the stuff you don’t want.
I will discuss Selenium in a different article later, since it is a completely different topic. Requests and beautifulsoup4 is generally good enough for basic scraping.