Web Scraping Using Selenium and BeautifulSoup

Oscar Rojo
The Startup
Published in
7 min readAug 16, 2020

--

Scrapy framework to solve lots of common web scraping problems.

Today we are going to take a look at Selenium and BeautifulSoup (with Python ❤️ ) with a step by step tutorial.

Photo by Anna Jiménez Calaf on Unsplash

It’s time to use Selenium

Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.

At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser end-to-end testing (acceptance tests).

Now it is still used for testing, but also as a general browser automation platform and of course, web scraping!

Selenium is really useful when you have to perform action on a website such as:

* clicking on buttons
* filling forms
* scrolling
* taking a screenshot

It is also very useful in order to execute Javascript code. Let’s say that you want to scrape a Single Page application, and that you don’t find an easy way to directly call the underlying APIs, then Selenium might be what you need.

Installation

We will use Chrome in our example, so make sure you have it installed on your local machine:

* Chrome download page
* Chrome driver binary
* selenium package

In order to install the Selenium package, as always, I recommend that you create a virtual environnement, using virtualenv for example, and then:

# !pip install selenium

Quickstart

Once you have downloaded both Chrome and Chromedriver, and installed the selenium package you should be ready to start the browser:

from selenium import webdriver

DRIVER_PATH = './chromedriver' #the path where you have "chromedriver" file.
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://google.com')

This will launch Chrome in headfull mode (like a regular Chrome, which is controlled by your Python code). You should see a message stating that the browser is controlled by an automated software.

In order to run Chrome in headless mode (without any graphical user interface), to run it on a server for example:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.thewindpower.net/country_media_es_3_espana.php")
#print(driver.page_source)
driver.quit()

The driver.page_source will return the full page HTML code.

Here are two other interesting webdriver properties:

* driver.title to get the page's title
* driver.current_url to get the current url (can be useful when there are redirections on the website and that you need the final URL)

Locating elements

Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract the data and save it for further analysis (web scraping).

There are many methods available in the Selenium API to select elements on the page. You can use:

* Tag name
* Class name
* IDs
* XPath
* CSS selectors

As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need. A cool shortcut for this is to highlight the element you want with your mouse, and then Ctrl + Shift + C or on macOS cmd + shift + c instead of having to right click + inspect each time:

Let´s work

In this tutorial we will build a web scraping program that will scrape a Github user profile and get the Repository Names and the Languages for the Pinned Repositories.

What will we require?

For this project we will use Python3.x.

We will also use the following packages and driver:

* selenium package — used to automate web browser interaction from Python
* ChromeDriver — provides a platform to launch and perform tasks in specified browser.
* Virtualenv — to create an isolated Python environment for our project.
* Extras: Selenium-Python ReadTheDocs Resource.

Project SetUp

Create a new project folder. Within that folder create an setup.py file. In this file, type in our dependency selenium.

# Create the file using "shell-terminal"
! touch setup.py
# Type the dependency selenium
! echo "selenium" > setup.py

Open up your command line & create a virtual environment using the basic command:

#! pip install virtualenv# Create virtualenv

! virtualenv webscraping_example
created virtual environment CPython3.7.6.final.0-64 in 424ms
creator CPython3Posix(dest=/home/oscar/Documentos/Medium/Selenium/webscraping_example, clear=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/oscar/.local/share/virtualenv)
added seed packages: pip==20.2.1, setuptools==49.2.1, wheel==0.34.2
activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator

Activate virtualenv

Next, install the dependency into your virtualenv by running the following command in the terminal:

#! pip install -r setup.py

Import Required Modules

Within the folder we created earlier, create a webscraping_example.py file and include the following code snippets.

# Create the file using "shell-terminal"

! touch webscraping_example/webscraping_example.py
# Use shell-terminal
#! vim webscraping_example/webscraping_example.py
1st import: Allows you to launch/initialise a browser.
2nd import: Allows you to search for things using specific parameters.
3rd import: Allows you to wait for a page to load.
4th import: Specify what you are looking for on a specific page in order to determine that the webpage has loaded.
5th import: Handling a timeout situation.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

Create new instance of Chrome in Incognito mode

First we start by adding the incognito argument to our webdriver.

options = Options()
options.add_argument("--incognito")

Next we create a new instance of Chrome.

DRIVER_PATH = './chromedriver'
browser = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

One thing to note is that the executable_path is the path that points to where you downloaded and saved your ChromeDriver.

Make The Request

When making the request we need to consider the following:

* Pass in the desired website url.
* Implement a Try/Except for handling a timeout situation should it occur.

In our case we are using “thewindpower.net” as the desired website url:

browser.get("https://www.thewindpower.net/country_media_es_3_espana.php")

Locating elements & finding the target one by page inspection:

items = len(browser.find_elements_by_class_name("lien_standard"))
items
1232elems = browser.find_elements_by_class_name("lien_standard")

Get URL

links = [elem.get_attribute('href') for elem in elems]

Show the first lines

links[:5]['https://www.thewindpower.net/windfarm_es_4418_cortijo-de-guerra-ii.php',
'https://www.thewindpower.net/windfarm_es_695_es-mila.php',
'https://www.thewindpower.net/windfarm_es_5281_la-castellana-(spain).php',
'https://www.thewindpower.net/windfarm_es_7024_valdeperondo.php',
'http://zigwen.free.fr/']
# initializing start Prefix
start_letter = 'https://'
result = [x for x in links if x.startswith(start_letter)]
result[0:20]['https://www.thewindpower.net/windfarm_es_4418_cortijo-de-guerra-ii.php',
'https://www.thewindpower.net/windfarm_es_695_es-mila.php',
'https://www.thewindpower.net/windfarm_es_5281_la-castellana-(spain).php',
'https://www.thewindpower.net/windfarm_es_7024_valdeperondo.php',
'https://www.thewindpower.net/windfarm_es_1922_a-capelada-i.php',
'https://www.thewindpower.net/windfarm_es_1919_a-capelada-ii.php',
'https://www.thewindpower.net/windfarm_es_10582_abuela-santa-ana.php',
'https://www.thewindpower.net/windfarm_es_19392_abuela-santa-ana-modificacion.php',
'https://www.thewindpower.net/windfarm_es_1920_adrano.php',
'https://www.thewindpower.net/windfarm_es_13465_aeropuerto-la-palma.php',
'https://www.thewindpower.net/windfarm_es_3865_aguatona.php',
'https://www.thewindpower.net/windfarm_es_2077_aibar.php',
'https://www.thewindpower.net/windfarm_es_9722_aibar.php',
'https://www.thewindpower.net/windfarm_es_2062_aizkibel.php',
'https://www.thewindpower.net/windfarm_es_2063_aizkibel.php',
'https://www.thewindpower.net/windfarm_es_21500_alaiz.php',
'https://www.thewindpower.net/windfarm_es_2083_alaiz.php',
'https://www.thewindpower.net/windfarm_es_2084_alaiz.php',
'https://www.thewindpower.net/windfarm_es_2106_alcarama-i.php',
'https://www.thewindpower.net/windfarm_es_2105_alcarama-i.php']
len(result)603

It’s time to use BeautifulSoup

Once we have obtained the URLs where the data is stored, we will use the BeautifulSoup library.

Load the libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup

Create functions

First, we create a function to get text from each websites and second we create another function to convert the list to dataframe

def obtener(url):
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
lista=[]
resulta = soup.find_all('li', {'class': 'puce_texte'})
for item in resulta:
text = item.text
text=str(text).replace("ó","o")
text=str(text).replace("í","i")
text=str(text).replace("ñ","n")
text=str(text).replace("í","i")
text=str(text).replace("°","°")
text=str(text).replace("é","e")
text=str(text).replace("á","a")
text=str(text).replace("\' ","' ")
text=str(text).replace("<br/>","")
text=str(text).replace("\n","")
text=str(text).replace("Operativo","Operativo : si")
text=str(text).replace("Parque eolico onshore","Parque eolico onshore : si")
text=str(text).replace("Imágenes de Google Maps","Imágenes de Google Maps : si")
lista.append(text)
return lista
def dataframe(lista):
datos= [i.split(': ', 1)[-1] for i in lista]
columna= [i.split(' :', 1)[0] for i in lista]
columnas =[career.lstrip('0123456789) ') for career in columna]
df = pd.DataFrame(datos).transpose()
df.columns=columnas
df = df.loc[:,~df.columns.duplicated()]
#df=df[['Nombre del parque eolico', 'Ciudad', 'Latitud','Longitud','Potencia nominal total' ]]
return df

Let’s see the result of the first 10 lines

result_first = result[:10]
result_first
['https://www.thewindpower.net/windfarm_es_4418_cortijo-de-guerra-ii.php',
'https://www.thewindpower.net/windfarm_es_695_es-mila.php',
'https://www.thewindpower.net/windfarm_es_5281_la-castellana-(spain).php',
'https://www.thewindpower.net/windfarm_es_7024_valdeperondo.php',
'https://www.thewindpower.net/windfarm_es_1922_a-capelada-i.php',
'https://www.thewindpower.net/windfarm_es_1919_a-capelada-ii.php',
'https://www.thewindpower.net/windfarm_es_10582_abuela-santa-ana.php',
'https://www.thewindpower.net/windfarm_es_19392_abuela-santa-ana-modificacion.php',
'https://www.thewindpower.net/windfarm_es_1920_adrano.php',
'https://www.thewindpower.net/windfarm_es_13465_aeropuerto-la-palma.php']

Get data in a Dataframe

Finally, using the list of URLs obtained with the selenium library, we generate our dataset with all the data obtained

DATA=pd.DataFrame()
for i in result:
lista=obtener(i)
DF=dataframe(lista)
DATA=DATA.append(DF,ignore_index=True)

This is the result

DATA

Save data

And finally we save the data

DATA.to_csv(r'datos_eolicos.csv')

Conclusion

As you can see with a couple of libraries we have been able to obtain the url and data of the wind farms located in Spain

I hope you enjoy this project

No matter what books or blogs or courses or videos one learns from, when it comes to implementation everything might look like “Out of Syllabus”

Best way to learn is by doing!
Best way to learn is by teaching what you have learned!

Never give up!

See you in Linkedin!

--

--

Oscar Rojo
The Startup

Master in Data Science. Passionate about learning new skills. Former branch risk analyst. https://www.linkedin.com/in/oscar-rojo-martin/. www.oscarrojo.es