Unlocking the Literary Treasures: Harnessing Beautiful Soup and Selenium to Scrape the Vancouver Public Library Website

Published in

Data And Beyond

6 min readJun 12, 2023

Web Scraping is the process of extracting data from websites using automated tools such as crawlers or bots. It can be useful to automate tasks such as filling out forms, performing market analysis, content aggregation etc. The typical process includes sending a request to a website, parsing through the website’s HTML code, and extracting data using XPath, CSS selectors etc.

Driven by a passion for data analysis, reading, and exploring local resources, I utilised automation test software to scrape the Vancouver Public Library website. My goal was to study the international language collection across different library locations to uncover meaningful patterns. In this article, I’ll walk you through the project’s journey, from data collection and preprocessing to generating insightful visualisations.

Selenium Web Driver

A popular automation tool, Selenium Web Driver can be used to load web pages and interact with dynamic elements such as drop-down menus.

Beautiful Soup Library

Beautiful soup is a Python library that is used to parse HTML document to extract information using tags, attributes etc.

Here is an example of a code to interact with “Choose your branch” drop down menu on Vancouver Public Library’s website.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import Select

# Install the necessary drivers for the browser you intend to automate (e.g., ChromeDriver for Google Chrome).
s = Service('/<your-path-to-downloaded-driver>/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.vpl.ca/borrowing/world-languages") 

# Return information from webpage & make a beautiful soup object to parse the contents
page_source = driver.page_source 
soup = BeautifulSoup(page_source)

# The code below scrapes the list of all VPL locations from the website, cleans & stores the result in a list called locations
locations = []
for tag in soup.find_all('select', id='edit-field-location-term-tid'): 
    locations = tag.get_text(strip=True).replace("Choose your branch","").\
                replace("Library","Library\t").replace("Branch","Branch\t").strip().split("\t")

print(locations)

['Britannia Branch', 'Carnegie Branch', 'Central Library', 'Champlain Heights Branch', 'Collingwood Branch', 'Dunbar Branch', 'Firehall Branch', 'Fraserview Branch', 'Hastings Branch', 'Joe Fortes Branch', 'Kensington Branch', 'Kerrisdale Branch', 'Kitsilano Branch', 'Marpole Branch', 'Mount Pleasant Branch', 'nə́c̓aʔmat ct Strathcona Branch', 'Oakridge Branch', 'Renfrew Branch', 'South Hill Branch', 'Terry Salman Branch', 'West Point Grey Branch']

The code then proceeds to scrape, clean, and print the international language collection carried by each location individually.

# Code below enables interaction with the location drop down menu on the VPL webpage, and for each location,
# scrapes, cleans & stores the international language collection carried by that location

data = []
dist_language_collections = []

for branch_index in range(len(locations)): 
    driver.get("https://www.vpl.ca/borrowing/world-languages")
    
    select = Select(driver.find_element_by_xpath('//*[@id="edit-field-location-term-tid"]'))
    select.select_by_index(branch_index+1)            # branch_index is used to select different library locations
                                                      # from the drop down menu (identified by x_path)
    time.sleep(5)
    
    soup = BeautifulSoup(driver.page_source)
    
    print(str(branch_index+1)+")", locations[branch_index])   # prints library location
    
    for tag in soup.find_all('h3', class_="field-content"): 
        print("\t•",tag.get_text())                           # prints all international language collections 
        data.append((locations[branch_index],tag.get_text()))
        dist_language_collections.append(tag.get_text())
    
    print("\n")

1) Britannia Branch
	• Chinese Language Collection 中文 
	• French Language Collection française
	• Spanish Language Collection  


2) Carnegie Branch
	• Chinese Language Collection 中文 


3) Central Library
	• Arabic Language Collection العربية 
	• Chinese Language Collection 中文 
	• French Language Collection française
	• German Language Collection Deutsch 
	• Italian Language Collection italiano 
	• Japanese Language Collection 日本語 
	• Korean Language Collection 한국어 
	• Persian/Farsi Language Collection فارسی 
	• Polish Language Collection język polski 
	• Portuguese Language Collection português 
	• Russian Language Collection Русский 
	• Spanish Language Collection  
	• Tagalog Language Collection 
	• Vietnamese Language Collection Tiếng Việt 

.
.
.

Mapping library locations and their language collections in Vancouver offers valuable insights into the city’s international demographics. Downtown and Northern Vancouver show a dominance of European languages, Eastern Vancouver is rich in South East Asian languages, Southern Vancouver reflects Indian languages, and Chinese and French languages are spread across the entire city. The Central Library has a comprehensive collection of material in all international languages.

By utilising Selenium and Beautiful Soup, the crawler code can be extended to automate the collection of over 100,000 records from the Vancouver Public Library world languages websites. This includes information such as language, title, author, rating, category, and availability for each record. The collected data is saved as excel files for further analysis & exploration.

# The code below is for crawler, which automates data collection i.e., information on language, title, author, 
# category, availability status & rating for 100K+ unique records

s = Service('/<your-path-to-downloaded-driver>/chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://www.vpl.ca/borrowing/world-languages") # accessing webpage 

select = Select(driver.find_element_by_xpath('//*[@id="edit-field-location-term-tid"]'))
select.select_by_index(0)
time.sleep(2)

dist_language_collections = sorted(list(set(dist_language_collections)))

for i in range(len(dist_language_collections)): # This loops over all languages available on webpage
    language_collection = []
    title = []
    author = []
    category = []
    status = []
    rating = []
    
    # The x_path below is to loacate specific language as identified by the loop above & 'click'
    # it to access all information corresponding to that language
    
    xpath = str('//*[@id="block-system-main"]/div/div[3]/ul/li[')+str(i+1)+str(']/div[1]/div[2]/a')
    driver.find_element_by_xpath(xpath).click()
    url_str = str(driver.current_url)
    url_str += "&sort=ugc_rating" # This modifies the url to sort all records by descending order of rating
    
    
    driver.get(url_str) 
    time.sleep(2)
    
    soup = BeautifulSoup(driver.page_source)
       
    # num_results reads the number of records available. This is divided by number of records per page i.e., 10
    # to calculate number of pages to loop over 
    
    num_results = int(soup.find('span', class_='cp-pagination-label').get_text().split(" ")[-2].replace(",",""))
    
    for page in range(num_results//10): # This loops over pages of records available for language identified
        page+=1                         # by outer loop
        
        if page ==1:
            url_str += "&page="+str(page)
        else:
            url_str = url_str.replace("&page="+str(page-1),"&page="+str(page)) # modify url to move to page
                                                                               # identified by inner loop
        driver.get(url_str)
        time.sleep(2)
        
        soup = BeautifulSoup(driver.page_source)          # create soup object recursively for a page 
                                                          # of records for a langauge
            
            
        for tag in soup.find_all('div', class_='col-md-10 item-column'): # for each record on the page 
           
            # store language information
            language_collection.append(dist_language_collections[i])
            
            # store title information
            title.append(tag.find('span',class_='title-content').get_text(strip=True))
            
            # store author information 
            if tag.find('span',class_='cp-author-link'):
                author.append(tag.find('span',class_='cp-author-link').get_text(strip=True))
            else:
                author.append("not available")
                
            # store category information
            category.append(tag.find('div', class_='manifestation-item cp-manifestation-list-item row').\
              find('span',class_='cp-format-indicator').get_text(strip=True))
            
            # store availability-status information 
            if tag.find('span', class_='cp-availability-status'):
                status.append(tag.find('span', class_='cp-availability-status').get_text(strip=True))   
            else:
                status.append("not available") 
          
            # store rating information 
            if tag.find('span', class_='cp-rating-stars'):
                rating.append(tag.find('span', class_='cp-rating-stars').get_text(strip=True))
            else:
                rating.append("not available")           
                
    # For each language, collect all data in a temporary pandas dataframe 
    data_temp = pd.DataFrame({'language_collection' : language_collection,
                                'title' : title,
                                'author': author,
                                'category' : category,
                                'status' : status,
                                'rating' : rating}, 
                                columns=['language_collection','title', 'author', 'category', 'status','rating'])    
    
    # Store information for each language in individual excel sheets    
    filename = '/web_scrapping/'+str(i)+'.xlsx'
    data_temp.to_excel(filename, index = False)
    
    # Go back to homepage, so the outer loop can 'click' next language & gather records on all pages of 
    # that language
    
    driver.get("https://www.vpl.ca/borrowing/world-languages")
    select = Select(driver.find_element_by_xpath('//*[@id="edit-field-location-term-tid"]'))
    select.select_by_index(0)
    time.sleep(2)

After gathering the data, Python’s libraries like pandas, seaborn and matplotlib can be employed to extract meaningful insights.

Here are a few distinctive insights to consider:

(1) Top 10 authors with the highest number of published books in Chinese, French, Japanese & Korean languages:

(2) The percentage of materials that falls under various categories for different languages:

(3) Analysing the availability status and category of materials:

The possibility for visualisation are endless.

The entire code for the crawler, data pre-processing and visualisation can be found here.

In an upcoming article, I will delve deeper into the world of web scraping, exploring how to extract data using APIs and cloud resources like AWS. In addition, I’ll walk you through the process of building a machine learning model that utilises the scraped data to make accurate predictions.

I hope you enjoyed reading this article!

Unlocking the Literary Treasures: Harnessing Beautiful Soup and Selenium to Scrape the Vancouver Public Library Website

Selenium Web Driver

Beautiful Soup Library

Written by Rochita Sundar