Unlocking the Literary Treasures: Harnessing Beautiful Soup and Selenium to Scrape the Vancouver Public Library Website

Rochita Sundar
Data And Beyond
Published in
6 min readJun 12, 2023

Web Scraping is the process of extracting data from websites using automated tools such as crawlers or bots. It can be useful to automate tasks such as filling out forms, performing market analysis, content aggregation etc. The typical process includes sending a request to a website, parsing through the website’s HTML code, and extracting data using XPath, CSS selectors etc.

Driven by a passion for data analysis, reading, and exploring local resources, I utilised automation test software to scrape the Vancouver Public Library website. My goal was to study the international language collection across different library locations to uncover meaningful patterns. In this article, I’ll walk you through the project’s journey, from data collection and preprocessing to generating insightful visualisations.

Photo Source: https://www.vpl.ca/.

Selenium Web Driver

A popular automation tool, Selenium Web Driver can be used to load web pages and interact with dynamic elements such as drop-down menus.

Beautiful Soup Library

Beautiful soup is a Python library that is used to parse HTML document to extract information using tags, attributes etc.

Here is an example of a code to interact with “Choose your branch” drop down menu on Vancouver Public Library’s website.

Photo Source: https://www.vpl.ca/.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import Select

# Install the necessary drivers for the browser you intend to automate (e.g., ChromeDriver for Google Chrome).
s = Service('/<your-path-to-downloaded-driver>/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.vpl.ca/borrowing/world-languages")

# Return information from webpage & make a beautiful soup object to parse the contents
page_source = driver.page_source
soup = BeautifulSoup(page_source)

# The code below scrapes the list of all VPL locations from the website, cleans & stores the result in a list called locations
locations = []
for tag in soup.find_all('select', id='edit-field-location-term-tid'):
locations = tag.get_text(strip=True).replace("Choose your branch","").\
replace("Library","Library\t").replace("Branch","Branch\t").strip().split("\t")

print(locations)
['Britannia Branch', 'Carnegie Branch', 'Central Library', 'Champlain Heights Branch', 'Collingwood Branch', 'Dunbar Branch', 'Firehall Branch', 'Fraserview Branch', 'Hastings Branch', 'Joe Fortes Branch', 'Kensington Branch', 'Kerrisdale Branch', 'Kitsilano Branch', 'Marpole Branch', 'Mount Pleasant Branch', 'nə́c̓aʔmat ct Strathcona Branch', 'Oakridge Branch', 'Renfrew Branch', 'South Hill Branch', 'Terry Salman Branch', 'West Point Grey Branch']

The code then proceeds to scrape, clean, and print the international language collection carried by each location individually.

# Code below enables interaction with the location drop down menu on the VPL webpage, and for each location,
# scrapes, cleans & stores the international language collection carried by that location

data = []
dist_language_collections = []

for branch_index in range(len(locations)):
driver.get("https://www.vpl.ca/borrowing/world-languages")

select = Select(driver.find_element_by_xpath('//*[@id="edit-field-location-term-tid"]'))
select.select_by_index(branch_index+1) # branch_index is used to select different library locations
# from the drop down menu (identified by x_path)
time.sleep(5)

soup = BeautifulSoup(driver.page_source)

print(str(branch_index+1)+")", locations[branch_index]) # prints library location

for tag in soup.find_all('h3', class_="field-content"):
print("\t•",tag.get_text()) # prints all international language collections
data.append((locations[branch_index],tag.get_text()))
dist_language_collections.append(tag.get_text())

print("\n")
1) Britannia Branch
• Chinese Language Collection 中文
• French Language Collection française
• Spanish Language Collection


2) Carnegie Branch
• Chinese Language Collection 中文


3) Central Library
• Arabic Language Collection العربية
• Chinese Language Collection 中文
• French Language Collection française
• German Language Collection Deutsch
• Italian Language Collection italiano
• Japanese Language Collection 日本語
• Korean Language Collection 한국어
• Persian/Farsi Language Collection فارسی
• Polish Language Collection język polski
• Portuguese Language Collection português
• Russian Language Collection Русский
• Spanish Language Collection
• Tagalog Language Collection
• Vietnamese Language Collection Tiếng Việt

.
.
.

Mapping library locations and their language collections in Vancouver offers valuable insights into the city’s international demographics. Downtown and Northern Vancouver show a dominance of European languages, Eastern Vancouver is rich in South East Asian languages, Southern Vancouver reflects Indian languages, and Chinese and French languages are spread across the entire city. The Central Library has a comprehensive collection of material in all international languages.

By utilising Selenium and Beautiful Soup, the crawler code can be extended to automate the collection of over 100,000 records from the Vancouver Public Library world languages websites. This includes information such as language, title, author, rating, category, and availability for each record. The collected data is saved as excel files for further analysis & exploration.

Photo Source: https://www.vpl.ca/.
# The code below is for crawler, which automates data collection i.e., information on language, title, author, 
# category, availability status & rating for 100K+ unique records

s = Service('/<your-path-to-downloaded-driver>/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.vpl.ca/borrowing/world-languages") # accessing webpage

select = Select(driver.find_element_by_xpath('//*[@id="edit-field-location-term-tid"]'))
select.select_by_index(0)
time.sleep(2)

dist_language_collections = sorted(list(set(dist_language_collections)))

for i in range(len(dist_language_collections)): # This loops over all languages available on webpage
language_collection = []
title = []
author = []
category = []
status = []
rating = []

# The x_path below is to loacate specific language as identified by the loop above & 'click'
# it to access all information corresponding to that language

xpath = str('//*[@id="block-system-main"]/div/div[3]/ul/li[')+str(i+1)+str(']/div[1]/div[2]/a')
driver.find_element_by_xpath(xpath).click()
url_str = str(driver.current_url)
url_str += "&sort=ugc_rating" # This modifies the url to sort all records by descending order of rating


driver.get(url_str)
time.sleep(2)

soup = BeautifulSoup(driver.page_source)

# num_results reads the number of records available. This is divided by number of records per page i.e., 10
# to calculate number of pages to loop over

num_results = int(soup.find('span', class_='cp-pagination-label').get_text().split(" ")[-2].replace(",",""))

for page in range(num_results//10): # This loops over pages of records available for language identified
page+=1 # by outer loop

if page ==1:
url_str += "&page="+str(page)
else:
url_str = url_str.replace("&page="+str(page-1),"&page="+str(page)) # modify url to move to page
# identified by inner loop
driver.get(url_str)
time.sleep(2)

soup = BeautifulSoup(driver.page_source) # create soup object recursively for a page
# of records for a langauge


for tag in soup.find_all('div', class_='col-md-10 item-column'): # for each record on the page

# store language information
language_collection.append(dist_language_collections[i])

# store title information
title.append(tag.find('span',class_='title-content').get_text(strip=True))

# store author information
if tag.find('span',class_='cp-author-link'):
author.append(tag.find('span',class_='cp-author-link').get_text(strip=True))
else:
author.append("not available")

# store category information
category.append(tag.find('div', class_='manifestation-item cp-manifestation-list-item row').\
find('span',class_='cp-format-indicator').get_text(strip=True))

# store availability-status information
if tag.find('span', class_='cp-availability-status'):
status.append(tag.find('span', class_='cp-availability-status').get_text(strip=True))
else:
status.append("not available")

# store rating information
if tag.find('span', class_='cp-rating-stars'):
rating.append(tag.find('span', class_='cp-rating-stars').get_text(strip=True))
else:
rating.append("not available")

# For each language, collect all data in a temporary pandas dataframe
data_temp = pd.DataFrame({'language_collection' : language_collection,
'title' : title,
'author': author,
'category' : category,
'status' : status,
'rating' : rating},
columns=['language_collection','title', 'author', 'category', 'status','rating'])

# Store information for each language in individual excel sheets
filename = '/web_scrapping/'+str(i)+'.xlsx'
data_temp.to_excel(filename, index = False)

# Go back to homepage, so the outer loop can 'click' next language & gather records on all pages of
# that language

driver.get("https://www.vpl.ca/borrowing/world-languages")
select = Select(driver.find_element_by_xpath('//*[@id="edit-field-location-term-tid"]'))
select.select_by_index(0)
time.sleep(2)

After gathering the data, Python’s libraries like pandas, seaborn and matplotlib can be employed to extract meaningful insights.

Here are a few distinctive insights to consider:

(1) Top 10 authors with the highest number of published books in Chinese, French, Japanese & Korean languages:

(2) The percentage of materials that falls under various categories for different languages:

(3) Analysing the availability status and category of materials:

The possibility for visualisation are endless.

The entire code for the crawler, data pre-processing and visualisation can be found here.

In an upcoming article, I will delve deeper into the world of web scraping, exploring how to extract data using APIs and cloud resources like AWS. In addition, I’ll walk you through the process of building a machine learning model that utilises the scraped data to make accurate predictions.

I hope you enjoyed reading this article!

--

--

Rochita Sundar
Data And Beyond

Passionate Data Scientist. Constantly learning about the evolving field of data with a knack for sharing insights. In my spare time, I love to travel and read.