Web scraping 101
Part one: Basic explanation, tools required, HTML structure, Python scraping libraries
One of the basic skills of a Data Scientist is the ability to identify and find data. On this post I’m going to provide a very simple way to get your web scraping under way. This is the foundation of creating your own data sets using the internet and very few lines of coding.
Things you will need:
- Any version of Python (already in your system if you’re an OSX user): Installing Anaconda will be ideal, but if you just want the basic tools we’ll use, run a $ pip install for the requests, bs4 (Beautiful Soup), pandas, and numpy libraries.
- Using a Jupyter notebook or other Python console is also helpful:
- Google Chrome (technically you can do it with any browser, but the way Chrome displays the underlying HTML code is great).
Before we continue, I will briefly explain a few concepts you need to be vaguely familiar with:
HTML code : Basically, the language and structure that websites uses to tell your browser what and how to display it for you. HTML structure is what we will concern ourselves with since parts of it is what we will be targeting in our scrape.
URL structure convention: The “web address” we use to navigate to our favorite website changes as we navigate to different sections of the website. This change has a structure and follows a pattern. Observing the pattern is what will allow your scraper to look through different sections of a website or different pages of a search. In fact, I’m going to take this opportunity to tell you to look for patterns everywhere in this process. Spotting those patterns will make your design more accurate with potentially less computational cost.
We begin our process by investigating the URL structure convention of our target website. For the purposes of this tutorial, we will use the job searching site Indeed.com. We will parse through the search results and capture the desired information.
I did a simple search for job description (data scientist) and entered a city (Miami). Immediately I compared the home URL to the search result URL, and noted the change.
Right away you can see that the URL structure took my job input, replaced the space with a plus (‘+’) sign, and added the city name at the end.
Clicking into the second and third pages of searches I noticed the following URLs:
What does this tell us?
Well, it tells us that the URL “turns pages” by adding “&start=” and an increasing multiple of 10.
Note: Some of you might’ve gotten what looked like random characters at the end of the URL. These are usually due to plug-ins and session tracking. These can be ignored without harming search results.
Now that we have this information, we are ready to start desiging the scraper!
We start in our Python console of choice. Jupyter Notebook generates input windows that generate their own output immediately below. This makes it very useful to explain steps in a process. The imports are as follows:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
Requests and BeautifulSoup will deal with the communication aspect of the operation between the server hosting the target website and your python console. Pandas and Numpy will be used to store/organize the data and perform basic operations.
There are some elements we will be using often, so let’s go ahead and set them up are variables:
URL = “https://www.indeed.com/jobs?q=data+scientist&l=Miami”
html = requests.get(URL)
soup = BeautifulSoup(html.text, ‘html.parser’, from_encoding=”utf-8")
With those lines I’m using the ‘get’ method on the ‘requests’ library to establish a target to point our scraper. Then, I’m employing the ‘Beautiful Soup’ method to extract the HTML code as text. This structure is safe to copy/paste for your own use. You’ll just need to change the web address you assign to the variable URL.
Believe it or not, we are more than half way there. If you want to check your progress, go ahead and type the command:
What you see is a text rendering of the HTML code in the target website. Start taking note of the words within triangle brackets “< >”.
Now, how do we make heads or tails of what we’re looking at and how do we target the information we are after? Enter Chrome inspect tool. For this example, we will design the scraper to grab the job title and the summary blurb, but you will see it is easy to have it grab more with a few lines of code extra.
Using Chrome, open the website you want to scrape, in this case, the search results for data scientist in Miami. Look for the feature you capture (job title or summary blurb) and right-click on it. Select the option “Inspect”. At this point a panel window opens on your Chrome window with the HTML code as it appears when we printed soup.prettify(). Moreover, the element you right clicked on will be highlighted and the HTML in the new panel centered around that element. Play around with it to get the hang of it. Try selecting other parts of the website and generally note how the website is structured.
Going back to our example, you will notice that when the job title is highlighted, the panel window highlights an entire paragraph worth of code. Find the part of the code that has the actual text as is it display and note the code immediately preceding the text of the job title. This is considered a “class” within the HTML code. Classes are one of the main structural component of HTML.
Conveniently, the class in this case is called “jobtitle”. I did the same for the summary blurb and found it was called, wait for it… “summary”.
Ok we have gathered all the information we needed and are ready to build the scraper that will grab job title and summary. Going back to our Python console, we set up a couple of more parameters as variables, mainly the maximum number of results and setting up the Pandas data frame that will house our data. While you can do this more computationally efficient by housing our data as list of lists, but most won’t notice a difference in speed.
max_results = 1000
jobs = pd.Dataframe()
If you remember from earlier, indeed.com “turns pages” by adding a multiple of 10. The code above makes sure that we don’t go beyond the 100th page.
Like all online job posting sites, Indeed posts more than one employment opportunity per page. This means your scraper can stop after the first hit! We need to account for that and tell it to keep looking to make sure you capture all the available related information. To accomplish that, we code a simple for loop.
for start in range(0, max_results, 10):
html = requests.get(URL + “&start=” + str(start))
soup = BeautifulSoup(html.text, ‘html.parser’, from_encoding=”utf-8")
As you can see this is essentially the same code as I used above, but I altered the html variable. I did this in order to fir the URL structure that indeed.com accepts, which is: the web address for the basic job and city, and add the components that allows the scraper to “turn pages”. In this case, that means the string “&start=” and a sequence of multiples of 10 (achieved by the for loop that encapsulates the html and soup variables), as a string.
In simpler terms, this means that the requests.get() method is being pointed first at:
And so on:
So what do we have so far? We have 5 lines of code that are looking at 100 different pages of HTML code from a search result in indeed.com. Good, we’re almost done.
To improve readability, I’m going to split this tutorial into two parts. To continue with our scraper, click here.