Webscraping using Selenium Python

Siddhartha

Published in

ML Book

8 min readOct 10, 2019

Content:

What is webscraping?
What is Selenium Python?
Setup and tools
Introduction of Selenium
Basic: Data extraction from http://www.gutenberg.org
Next Tutorial

1.0 What is Web Scraping?

Web scarping is extraction of available unstructured public data from webpages in structured way.

2.0 What is Selenium Python?

Selenium is an open-source web-based automation tool. Python language and other languages are used with Selenium for testing as well as web scraping. Here we will use Firefox, BTW you can try on any browser as it is almost same as Firefox.

3.0 Setup and tools

3.1 install selenium using pip

pip install selenium

3.2 or install with conda

conda install -c conda-forge selenium

3.3 Download web drivers, you can choose any of these drivers

3.3.1 follow this link for Chrome driver

Downloads - ChromeDriver - WebDriver for Chrome

WebDriver for Chrome

WebDriver for Chromechromedriver.chromium.org

3.3.2 follow this link for Firefox driver(geckodriver)

mozilla/geckodriver

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

3.3.3 For edge

Microsoft WebDriver

Microsoft Edge (EdgeHTML) Go to Settings > Update and Security > For Developer and then select "Developer mode". For…

developer.microsoft.com

3.3.4 For Safari

WebDriver Support in Safari 10

Starting with Safari 10 on OS X El Capitan and macOS Sierra, Safari comes bundled with a WebDriver implementation…

webkit.org

Through out these tutorials I will use Firefox, you are free to choose other browsers also Chrome, Safari, Microsoft edge, Opera.

4.0 Introduction of selenium

You can find proper documentation on selenium here

Following methods will help to find elements in a webpage (these methods will return a list):

find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

In this tutorial we will use only ‘find_elements_by_class_name’ and ‘find_elements_by_tag_name’ and there are other methods also which we will use in upcoming tutorials.

You can find complete documentation of these methods here

5.0 Basic: Data extraction from http://www.gutenberg.org

5.1 Setup

here ”C:\Users\siddhartha\Downloads\geckodriver-v0.25.0-win64\geckodriver.exe” is path of driver, where it is downloaded .

or you can do same Chrome

driver = webdriver.Chrome(r”C:\Users\siddhartha\Downloads\chromedriver_win32\chromedriver.exe”)

after running these codes, a new window will open, which look like this

5.2 Go to target webpage

‘’http://www.gutenberg.org/ebooks/search/%3Fsort_order%3Drelease_date' is our target page, after running this code you will see our target webpage on browser

5.3 Our objective

In this tutorial our objective is to extract data from this page, page contain book names, their author and release date, we will extract all these data of these 25 books, and then we will go next page to extract next page’s books data and so on……

5.4 Inspect elements

Click on inspect elements or press F12

this will open your inspector window in bottom, you can shift this inspector window to right, click on ‘…’ in right side then click on ‘dock to right’, as shown below

Click on the following button to inspect elements shown below

Try to inspect first item i.e. book

You will see that this item (book) belongs to class ‘booklink’, and other books also belongs to this class: means you can use this ‘class’ to find our target elements i.e. books, by using following code

books = driver.find_elements_by_class_name(‘booklink’)
len(books)
>> 25

this books list contains all elements of books, you can varify that what these elements contains , first item of list will contain first book data, last one will contain data of last book.

element.text will help to see the text within element

print(books[0].text)
>>>
Lady Patricia
Rudolf Besier
Oct 10, 2019print(books[-1].text)
>>>
Class Book for the School of Musketry, Hythe
Ernest Christian Wilford
Oct 6, 2019

Now inspect the name , author and release date of book

We will look structure of only one book, which will be same as other books, we will write code to extract only for one book then generalize this code to extract data of all books

You can see that name belongs to class ‘title’, author beolongs to class ‘subtitle’ and release date belongs to class ‘extra’, so using these class name we can find this elements from out book element, using following code

name = books[0].find_elements_by_class_name(‘title’)[0].text
author = books[0].find_elements_by_class_name(‘subtitle’)[0].text
date = books[0].find_elements_by_class_name(‘extra’)[0].text
print(name)
print(author)
print(date)
>>
Lady Patricia
Rudolf Besier
Oct 10, 2019

you can also see this for last book

name = books[-1].find_elements_by_class_name('title')[0].text
author = books[-1].find_elements_by_class_name('subtitle')[0].text
date = books[-1].find_elements_by_class_name('extra')[0].text
print(name)
print(author)
print(date)
>>
Class Book for the School of Musketry, Hythe
Ernest Christian Wilford
Oct 6, 2019

now you can iterate over books list to get data of all books

In the above code I have used try and except for handling erros because some data may be absent or may have different structure, which cause error, hence our code stop working, hence this error handling is very usefull.

For demonstration purpose I will run over 5 item in list here

for book in books[:5]:
    name = book.find_elements_by_class_name('title')[0].text
    author = book.find_elements_by_class_name('subtitle')[0].text
    date = book.find_elements_by_class_name('extra')[0].text
    print('name:', name)
    print('author :', author)
    print('date :', date)
    print('_'*100)
>>name: Lady Patricia
author : Rudolf Besier
date : Oct 10, 2019
____________________________________________________________________
name: El arbol de la ciencia (Spanish)
author : PÃo Baroja
date : Oct 10, 2019
____________________________________________________________________
name: Frank Merriwell's Diamond Foes
author : Burt L. Standish
date : Oct 9, 2019
____________________________________________________________________
name: Conservation
author : Charles L. Fontenay
date : Oct 9, 2019
____________________________________________________________________
name: Teddy and the Mystery Deer
author : Howard Roger Garis
date : Oct 8, 2019

you can store this data to csv file or any other format

5.4 Go to Next Pages

Find element of Next page button

you can see that this Next button belongs to class ‘statusline’ under tag name <a> which is link, this link will lead to next page, we will have to use element.click() method to go on next page

see elements of class ‘stutusline’

driver.find_elements_by_class_name(‘statusline’)
>>
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="bd690ebc-acca-4ce3-9601-8479e4097fcd")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="0ee2ee11-de6f-4a89-95ef-676b9f02bcce")>]

which contains two elements check both elements by .text

driver.find_elements_by_class_name(‘statusline’)[0].text
>>'Displaying results 1–25 | Next'driver.find_elements_by_class_name(‘statusline’)[1].text
>>'Displaying results 1–25 | Next'

both elements have same data, we can use any one of these elements

as you saw that the Next button link is in <a> tag, hence we can find that element using tag name ‘a’

statusline = driver.find_elements_by_class_name(‘statusline’)[0]
next_button = statusline.find_elements_by_tag_name(‘a’)
next_button>>
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="d827fb4b-c0c6-40a2-a0cf-3c46370a0360")>]

you can check text of that element

print(next_button[0].text)
>>
Next

now we have to click on this button

next_button.click()

after running this code you browser will open next page.

but if you on next page there are more button in ‘statusline’ class, hence when you run

print(next_button[0].text)
>>
First

your button will be First, intead of Next

When you check your elements of next_button, there are 3 elements

statusline = driver.find_elements_by_class_name(‘statusline’)[0]
next_button = statusline.find_elements_by_tag_name(‘a’)
next_button
>>
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="1883710a-209f-413a-ac2c-93cb18830e3d")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="4df3709e-20a3-4041-a040-9f97c3e07d8e")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="14e60fca-8feb-4c00-a555-bd4692a126af")>]

check first and last one

next_button[0].text
>> 'First'
next_button[-1].text
>>'Next'

hence use next_button[-1].click() instead of next_button[0].click()

On new page you can do same process as previous page or we can use loop over these pages to extract data, in this case we don’t know how many such pages are there, hence we can apply while loop.

following code will extract data of 5 pages, means it will collect data from one page then it will click on next , then again it will collect data of next page , such process will be repeat 5 times

in the above code I have used try and except for handling erros because some data may be absent or may have different structure, which cause error, hence our code stop working, hence this error handling is very usefull.

for demonstation pupose I have extractd data of only 2 books from each page

count = 0
while True:
    if count==5:
        break
    count +=1
    print('page ',count)
    books = driver.find_elements_by_class_name('booklink')
    
    for book in books[:2]:
        name = book.find_elements_by_class_name('title')[0].text
        author = book.find_elements_by_class_name('subtitle')[0].text
        date = book.find_elements_by_class_name('extra')[0].text
        print('name:', name)
        print('author :', author)
        print('date :', date)
        print('_'*100)
        
        
    driver.find_elements_by_class_name('statusline')[0].find_elements_by_tag_name('a')[-1].click()
    print('|'*100)
>>
page  1
name: Lady Patricia
author : Rudolf Besier
date : Oct 10, 2019
____________________________________________________________________
name: El arbol de la ciencia (Spanish)
author : PÃo Baroja
date : Oct 10, 2019
____________________________________________________________________
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
page  2
name: London Labour and the London Poor (Vol. 2 of 4)
author : Henry Mayhew
date : Oct 6, 2019
____________________________________________________________________
name: Kotkat (Finnish)
author : Hilja Haahti
date : Oct 6, 2019
____________________________________________________________________
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
page  3
name: When William IV. Was King
author : John Ashton
date : Oct 3, 2019
____________________________________________________________________
name: The Viking's Skull
author : John R. Carling
date : Oct 3, 2019
____________________________________________________________________
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
page  4
name: History of the Peninsular War, Volume 5 (of 6)
author : Robert Southey
date : Sep 30, 2019
____________________________________________________________________
name: History of the Peninsular War, Volume 4 (of 6)
author : Robert Southey
date : Sep 30, 2019
____________________________________________________________________
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
page  5
name: The Collected Writings of Dougal Graham, "Skellat" Bellman of Glasgow, Vol. 1 of 2
author : Dougal Graham
date : Sep 26, 2019
____________________________________________________________________
name: The German Fury in Belgium
author : L. Mokveld
date : Sep 26, 2019
____________________________________________________________________
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Now finally you have extracted data 😀
One additional thing is that once you have written proper code then the browser is not important you can collect data without browser, Which is called headless browser window, hence replace the following code with the previous one.

Headless Firefox

Headless Chrome

In this case browser will not run in background which is very helpful.

Finally, this tutorial ends here, Though this tutorial is very simple but I hope you have learned important things 🙂

6.0 Next Tutorial

You can read next tutorial here

Please clap if you like this tutorial

Join our Telegram channel for more updates, study resources and discussion

Join and earn ₹31