An easy introduction to web scraping using python + selenium.

Fernando Luna
MCD-UNISON
Published in
3 min readNov 30, 2022

Selenium is an open-source tool that automates web browsers actions, it can be used along with python to create scripts that help us automate human activities on a web page, for example, when we have to use APIs to download information, we have to enter into the web page, to put text into a textbox, to use the scrollbar, or clicking some buttons to get and download the information we need. But all of this can be automated with Selenium! We are going to explain the basics of this tool.

Starting with Selenium.

!pip install selenium
!pip install webdriver-manager
!pip install BeautifulSoup

Now we have to import the needed libraries.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from bs4 import BeautifulSoup
import requests
import pandas as pd

We have to configure our browser.

options =  webdriver.ChromeOptions()
options.add_argument('--start-maximized')
options.add_argument('--disable-extensions')

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.set_window_position(2000, 0)
driver.maximize_window()

The next line is optional, is the time the script will wait to perform the next action.

time.sleep(1)

Next function is to initialize the browser in any page.

driver.get('https://mcd.unison.mx/')

We need to inspect the code from this web page, and identify the element that we need.

The Xpath give us the route for our element, this is unique for all elements in the page, so we have to copy the full Xpath.

We get the element by using the function find element by Xpath.

element = driver.find_element(By.XPATH, '/html/body/div/header/div[2]/div/div/nav/div/ul/li[2]/a')

Now we need to create an Action Chain, this is a way to automate low-level interactions such as mouse movements, mouse button actions, keypress, and context menu interactions.

action = ActionChains(driver)

Perform our action chain to move the cursor to our previously selected element.

action.move_to_element(element).perform()

The next step is similar, but now we need to change our element and the action to do, that will be a click.

element = driver.find_element(By.XPATH, '/html/body/div/header/div[2]/div/div/nav/div/ul/li[2]/ul/li[1]/a')
action = ActionChains(driver)
action.click(element).perform()

Next line allows us to move into the bottom of the page using the scrollbar.

driver.execute_script("window.scrollTo(0,2400)")

To get the content, we need to read the html from the current page using the library BeautifulSoup . We use find to get the element that contains the information we need.

page = requests.get(driver.current_url)
soup = BeautifulSoup(page.content, 'html.parser')
tabla = soup.find('div', {"class": "entry-content"}).find_all('li')

Using a for cycle get the one by one the information.

listaEstudiantes = list()
for i in range(len(tabla)):
estudiante = tabla[i].string
listaEstudiantes.append(estudiante)

We stored the information into a pandas dataframe, after this is stored into a CSV file.

df = pd.DataFrame({'ASPIRANTES ACEPTADOS 2022': listaEstudiantes})
df.to_csv('Lista_Estudiantes.csv', index=False, encoding='UTF-16le'

The final step is to close the browser.

driver.quit()

Selenium is a powerful tool for web scraping to download information from web pages, this is a basic example to understand how it works.

The complete code is here.

--

--