End-to-end guide to building an automatic web scraper to track the google keyword ranking result (Using Python + Selenium + Beautiful Soup)
The aim of this project is for learning web scraping technique by building a real-life application — a Google keyword ranking tracker. This article will guide you to build the tool end-to-end with explaination. It will introduce some of the technique needed on web scraping and the use of some powerful packages like Selenium and Beautiful Soup. And after the scraper is built, it will also demonstrate how it can be automated. Hope you enjoy it :)
- Introduction of the project
- Preparation
- Website inspection
- Coding (+full script)
- Setting up scheduled tasks to run the scraper
1. Introduction of the project
In this project, we have a few tasks to do:
— Firstly, we have to prepare a list of target keywords that we are interested to track. It can be from Excel, GoogleSheet, or a database.
— Secondly, we have to build the scraper with Python. Actually, the coding part of the scraper is usually simpler than we expected (highly depends on the complexity of the website). It can be a few lines of code to iterate the website pages and to extract the sources. However, we would spend more time inspecting the structure of the website. In this project, we will scrape the search result page of https://www.google.com/. The page will be iterated by filling in different keywords which were input from the keyword list. We can also decide the number of search results to collect for each keyword.
— Thirdly, the scraped source has to be cleaned by extracting the useful elements. BeautifulSoup could be very helpful on this part to extract the information like the search result domain and the description. We can distinguish if the search result is SEM or organic. The extracted elements will be filled into dataframe and thus the storage for further analytics purposes.
— Finally, to make it like a real-life application, the scraper should be automated to run regularly. The article will demonstrate how the script is scheduled locally in Windows environment.
2. Preparation
- A list of target keywords
The list can be created in Excel, GoogleSheet, database, or anywhere you like. In this example, we will prepare the target keyword list in Excel. Assume we are running a pet supplies store and we are interested in tracking the keywords of our products. e.g.
- Python libraries to install
Here are the packages we would need to install in this project. The first one is Selenium which helps us to automate browsers. You may also need to download a web driver to work with it (Chrome/FireFox/Edge). Otherwise, you may install the ‘undetected-chromedriver’ package which is an optimized Selenium Chromedriver patch that does not trigger anti-bot services. (for detail see https://pypi.org/project/undetected-chromedriver/). Another package to install is Beautifulsoup, which is a powerful tool to navigate the source data (i.e. html) scraped from the website.
pip install selenium
pip install undetected-chromedriver # <- optional
pip install beautifulsoup4
3. Website inspection
After the libraries are installed, now we move to the core part of web scraping — website inspection. In this part, we would examine how does the website work. That includes how our target elements are located, the structure of the webpages, the URL, and also the HTTP (GET/POST) request. This part would determine the whole design of the scraper.
- URL
Go back to our example — a pet supplies store. First of all, we have to pass the URL to the web driver to request the web page. So the first thing we need to understand is how the web page is changed with the URL we input. Many websites would pass the query keyword & result criteria into the URL (same for Google Search). In this example, we can pass the keywords (e.g. pet supplies) and the no. of result (e.g. 50) in the following format to request the search result page we desired.
https://www.google.com/search?num={no. of result returned}&q={keywords}
- Elements
Next step we have to locate our target elements to retrieve from the web page. On the browser with the search result page, press “F12” to launch the DevTool. On the ‘Elements’ tab, we can use the pointer to select an element in the page to inspect. In this example, by inspecting the elements that contain the search result, we found that the SEM results (paid search ads) are contained by <div class="v5yQqb">
while the SEO results (organic) are contained by <div class="yuRUbf">
.
And within the SEM result element <div class="v5yQqb">
, the domain name element is contained by the <span>
with class started with x2VHCd
.
Similarly, we can inspect that within the organic result element <div class="yuRUbf">
, the domain name element is contained by the <cite>
with class started with iUh30
.
4. Coding
Finally, we can start coding the scraper. First of all, let’s import all the packages we need.
import undetected_chromedriver as uc
from selenium import webdriver
from bs4 import BeautifulSoup
from datetime import date
import time
import pandas as pd
Then initialize the variables and read the target keyword file.
#Define working directory
wd = r’C:\TommyLo\google-rank-tracker’#Read Keyword file
df.keywords = pd.read_excel(wd+r’\keywords.xlsx’)# No. of top N result extracting
n_result = 50# initializing
se_results = []
n_keyword = 0
Starting the web driver. You may see the browser pop up when the following script is run. (or you may pass the ‘headless’ argument into the ChromeOptions to hide the browser)
#Start webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--test-type')
# options.add_argument('--headless') #hidden browser
driver = uc.Chrome(options=options)
Passing the keywords into the web driver to load the page, getting the page source from the driver, and then parsing the page source with BeautifulSoup.
n_rank = 0 #reset rank
n_keyword += 1
url = ‘https://www.google.com.hk/search?num={}&q={}'.format(n_result,keyword)
print(‘#{} — — {} — — {} …’.format(n_keyword,keyword,url))
driver.get(url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, ‘lxml’)
From the element inspection, we know that <div class="v5yQqb">
& <div class="yuRUbf">
are our target elements. Let’s iterate to extract them from the source with the help of BeautifulSoup. Besides, we can track the ranking position of the search result by counting the no. of elements iterated in the loop. After extracting the element, we pass the data into a temp dictionary, and later on we will transform it into a pandas dataframe for output or analysis.
results_selector = soup.select(‘div[class*=”v5yQqb”] , div[class*=”yuRUbf”]’)
#Loop over the results
for result_selector in results_selector:
# Case when SEM / Organic Result
if result_selector[‘class’][0].startswith(‘v5yQqb’):
domain_name_class = ‘x2VHCd’
result_type = ‘SEM’
domain_name = result_selector.select(‘span[class*=”{}”]’.format(domain_name_class))[0].get_text()
else:
domain_name_class = ‘iUh30’
result_type = ‘Organic’
domain_name = result_selector.select(‘cite[class*=”{}”]’.format(domain_name_class))[0].get_text()
link = result_selector.select(‘a’)[0][‘href’]
n_rank += 1
temp_dict = {
‘query_date’ : date.today().strftime(“%Y%m%d”),
‘keyword’ : keyword,
‘rank’ : n_rank,
‘result_type’ : result_type,
‘domain_name’ : domain_name,
‘link’ : link
}
se_results.append(temp_dict)
At last, we close the driver and export the result into csv for storage.
driver.close()
driver.quit()df_se_results = pd.DataFrame(se_results)###Export to csv
df_se_results.to_csv(wd+r’\results\se_result_{}.csv’.format(date.today().strftime(“%Y%m%d”)),encoding=’utf-8',index=False)
Here’s the full script:
5. Setting up scheduled tasks to run the scraper
If the scraper can be run properly in the console, that’s great! Finally, we would like to have the scraper to be scheduled to run. In this example, we will try to do this locally in a Windows environment with Task Scheduler.
Task Scheduler cannot run a python (.py) file directly. However, it can run a batch (.bat) file which calls the python.exe to run the python file. To prepare a .bat file, we can open a notepad and save it as .bat files. And in the file, as I am using Anaconda environment, I would first call the Anaconda activite.bat as I want to script to be run under the same environment (with path ‘C:\ProgramData\Anaconda3\Scripts\activate.bat
). Then call the python.exe under the Anaconda folder, followed by the path of the .py file to run.
i.e.
Next step we will setup the task on Task Scheduler. Launch it and then click ‘Create Task…’
On the triggers tab, we can set it to run at 9am daily.
And then on the Actions tab, create a new action and put the batch file we have created in the previous step.
Great! Then the task has been set. The task scheduler will help you to run the scraper everyday morning. If you want to test the task, you may click ‘run’ on the task scheduler interface.
Hope you enjoy the article! If you found this article helpful, please click follow and I am going to publish more articles related to data analytics. Thank you for reading :)