Web Scraping Using Selenium and BeautifulSoup
Scrapy framework to solve lots of common web scraping problems.
Today we are going to take a look at Selenium and BeautifulSoup (with Python ❤️ ) with a step by step tutorial.
It’s time to use Selenium
Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.
The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.
At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser end-to-end testing (acceptance tests).
Now it is still used for testing, but also as a general browser automation platform and of course, web scraping!
Selenium is really useful when you have to perform action on a website such as:
* clicking on buttons
* filling forms
* scrolling
* taking a screenshot
It is also very useful in order to execute Javascript code. Let’s say that you want to scrape a Single Page application, and that you don’t find an easy way to directly call the underlying APIs, then Selenium might be what you need.
Installation
We will use Chrome in our example, so make sure you have it installed on your local machine:
* Chrome download page
* Chrome driver binary
* selenium package
In order to install the Selenium package, as always, I recommend that you create a virtual environnement, using virtualenv for example, and then:
# !pip install selenium
Quickstart
Once you have downloaded both Chrome and Chromedriver, and installed the selenium package you should be ready to start the browser:
from selenium import webdriver
DRIVER_PATH = './chromedriver' #the path where you have "chromedriver" file.
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://google.com')
This will launch Chrome in headfull mode (like a regular Chrome, which is controlled by your Python code). You should see a message stating that the browser is controlled by an automated software.
In order to run Chrome in headless mode (without any graphical user interface), to run it on a server for example:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.thewindpower.net/country_media_es_3_espana.php")
#print(driver.page_source)
driver.quit()
The driver.page_source will return the full page HTML code.
Here are two other interesting webdriver properties:
* driver.title to get the page's title
* driver.current_url to get the current url (can be useful when there are redirections on the website and that you need the final URL)
Locating elements
Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract the data and save it for further analysis (web scraping).
There are many methods available in the Selenium API to select elements on the page. You can use:
* Tag name
* Class name
* IDs
* XPath
* CSS selectors
As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need. A cool shortcut for this is to highlight the element you want with your mouse, and then Ctrl + Shift + C or on macOS cmd + shift + c instead of having to right click + inspect each time:
Let´s work
In this tutorial we will build a web scraping program that will scrape a Github user profile and get the Repository Names and the Languages for the Pinned Repositories.
What will we require?
For this project we will use Python3.x.
We will also use the following packages and driver:
* selenium package — used to automate web browser interaction from Python
* ChromeDriver — provides a platform to launch and perform tasks in specified browser.
* Virtualenv — to create an isolated Python environment for our project.
* Extras: Selenium-Python ReadTheDocs Resource.
Project SetUp
Create a new project folder. Within that folder create an setup.py file. In this file, type in our dependency selenium.
# Create the file using "shell-terminal"
! touch setup.py# Type the dependency selenium
! echo "selenium" > setup.py
Open up your command line & create a virtual environment using the basic command:
#! pip install virtualenv# Create virtualenv
! virtualenv webscraping_examplecreated virtual environment CPython3.7.6.final.0-64 in 424ms
creator CPython3Posix(dest=/home/oscar/Documentos/Medium/Selenium/webscraping_example, clear=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/oscar/.local/share/virtualenv)
added seed packages: pip==20.2.1, setuptools==49.2.1, wheel==0.34.2
activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
Activate virtualenv
Next, install the dependency into your virtualenv by running the following command in the terminal:
#! pip install -r setup.py
Import Required Modules
Within the folder we created earlier, create a webscraping_example.py file and include the following code snippets.
# Create the file using "shell-terminal"
! touch webscraping_example/webscraping_example.py# Use shell-terminal
#! vim webscraping_example/webscraping_example.py
1st import: Allows you to launch/initialise a browser.
2nd import: Allows you to search for things using specific parameters.
3rd import: Allows you to wait for a page to load.
4th import: Specify what you are looking for on a specific page in order to determine that the webpage has loaded.
5th import: Handling a timeout situation.from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
Create new instance of Chrome in Incognito mode
First we start by adding the incognito argument to our webdriver.
options = Options()
options.add_argument("--incognito")
Next we create a new instance of Chrome.
DRIVER_PATH = './chromedriver'
browser = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
One thing to note is that the executable_path is the path that points to where you downloaded and saved your ChromeDriver.
Make The Request
When making the request we need to consider the following:
* Pass in the desired website url.
* Implement a Try/Except for handling a timeout situation should it occur.
In our case we are using “thewindpower.net” as the desired website url:
browser.get("https://www.thewindpower.net/country_media_es_3_espana.php")
Locating elements & finding the target one by page inspection:
items = len(browser.find_elements_by_class_name("lien_standard"))
items1232elems = browser.find_elements_by_class_name("lien_standard")
Get URL
links = [elem.get_attribute('href') for elem in elems]
Show the first lines
links[:5]['https://www.thewindpower.net/windfarm_es_4418_cortijo-de-guerra-ii.php',
'https://www.thewindpower.net/windfarm_es_695_es-mila.php',
'https://www.thewindpower.net/windfarm_es_5281_la-castellana-(spain).php',
'https://www.thewindpower.net/windfarm_es_7024_valdeperondo.php',
'http://zigwen.free.fr/']# initializing start Prefix
start_letter = 'https://'
result = [x for x in links if x.startswith(start_letter)]result[0:20]['https://www.thewindpower.net/windfarm_es_4418_cortijo-de-guerra-ii.php',
'https://www.thewindpower.net/windfarm_es_695_es-mila.php',
'https://www.thewindpower.net/windfarm_es_5281_la-castellana-(spain).php',
'https://www.thewindpower.net/windfarm_es_7024_valdeperondo.php',
'https://www.thewindpower.net/windfarm_es_1922_a-capelada-i.php',
'https://www.thewindpower.net/windfarm_es_1919_a-capelada-ii.php',
'https://www.thewindpower.net/windfarm_es_10582_abuela-santa-ana.php',
'https://www.thewindpower.net/windfarm_es_19392_abuela-santa-ana-modificacion.php',
'https://www.thewindpower.net/windfarm_es_1920_adrano.php',
'https://www.thewindpower.net/windfarm_es_13465_aeropuerto-la-palma.php',
'https://www.thewindpower.net/windfarm_es_3865_aguatona.php',
'https://www.thewindpower.net/windfarm_es_2077_aibar.php',
'https://www.thewindpower.net/windfarm_es_9722_aibar.php',
'https://www.thewindpower.net/windfarm_es_2062_aizkibel.php',
'https://www.thewindpower.net/windfarm_es_2063_aizkibel.php',
'https://www.thewindpower.net/windfarm_es_21500_alaiz.php',
'https://www.thewindpower.net/windfarm_es_2083_alaiz.php',
'https://www.thewindpower.net/windfarm_es_2084_alaiz.php',
'https://www.thewindpower.net/windfarm_es_2106_alcarama-i.php',
'https://www.thewindpower.net/windfarm_es_2105_alcarama-i.php']len(result)603
It’s time to use BeautifulSoup
Once we have obtained the URLs where the data is stored, we will use the BeautifulSoup library.
Load the libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
Create functions
First, we create a function to get text from each websites and second we create another function to convert the list to dataframe
def obtener(url):
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
lista=[]
resulta = soup.find_all('li', {'class': 'puce_texte'})
for item in resulta:
text = item.text
text=str(text).replace("ó","o")
text=str(text).replace("Ã","i")
text=str(text).replace("ñ","n")
text=str(text).replace("í","i")
text=str(text).replace("°","°")
text=str(text).replace("é","e")
text=str(text).replace("á","a")
text=str(text).replace("\' ","' ")
text=str(text).replace("<br/>","")
text=str(text).replace("\n","")
text=str(text).replace("Operativo","Operativo : si")
text=str(text).replace("Parque eolico onshore","Parque eolico onshore : si")
text=str(text).replace("Imágenes de Google Maps","Imágenes de Google Maps : si")
lista.append(text)
return lista
def dataframe(lista):
datos= [i.split(': ', 1)[-1] for i in lista]
columna= [i.split(' :', 1)[0] for i in lista]
columnas =[career.lstrip('0123456789) ') for career in columna]
df = pd.DataFrame(datos).transpose()
df.columns=columnas
df = df.loc[:,~df.columns.duplicated()]
#df=df[['Nombre del parque eolico', 'Ciudad', 'Latitud','Longitud','Potencia nominal total' ]]
return df
Let’s see the result of the first 10 lines
result_first = result[:10]
result_first['https://www.thewindpower.net/windfarm_es_4418_cortijo-de-guerra-ii.php',
'https://www.thewindpower.net/windfarm_es_695_es-mila.php',
'https://www.thewindpower.net/windfarm_es_5281_la-castellana-(spain).php',
'https://www.thewindpower.net/windfarm_es_7024_valdeperondo.php',
'https://www.thewindpower.net/windfarm_es_1922_a-capelada-i.php',
'https://www.thewindpower.net/windfarm_es_1919_a-capelada-ii.php',
'https://www.thewindpower.net/windfarm_es_10582_abuela-santa-ana.php',
'https://www.thewindpower.net/windfarm_es_19392_abuela-santa-ana-modificacion.php',
'https://www.thewindpower.net/windfarm_es_1920_adrano.php',
'https://www.thewindpower.net/windfarm_es_13465_aeropuerto-la-palma.php']
Get data in a Dataframe
Finally, using the list of URLs obtained with the selenium library, we generate our dataset with all the data obtained
DATA=pd.DataFrame()
for i in result:
lista=obtener(i)
DF=dataframe(lista)
DATA=DATA.append(DF,ignore_index=True)
This is the result
DATA
Save data
And finally we save the data
DATA.to_csv(r'datos_eolicos.csv')
Conclusion
As you can see with a couple of libraries we have been able to obtain the url and data of the wind farms located in Spain
I hope you enjoy this project
No matter what books or blogs or courses or videos one learns from, when it comes to implementation everything might look like “Out of Syllabus”
Best way to learn is by doing!
Best way to learn is by teaching what you have learned!
Never give up!
See you in Linkedin!