Webscraping using Selenium Python
Content:
- What is webscraping?
- What is Selenium Python?
- Setup and tools
- Introduction of Selenium
- Basic: Data extraction from http://www.gutenberg.org
- Next Tutorial
1.0 What is Web Scraping?
Web scarping is extraction of available unstructured public data from webpages in structured way.
2.0 What is Selenium Python?
Selenium is an open-source web-based automation tool. Python language and other languages are used with Selenium for testing as well as web scraping. Here we will use Firefox, BTW you can try on any browser as it is almost same as Firefox.
3.0 Setup and tools
3.1 install selenium using pip
pip install selenium
3.2 or install with conda
conda install -c conda-forge selenium
3.3 Download web drivers, you can choose any of these drivers
3.3.1 follow this link for Chrome driver
3.3.2 follow this link for Firefox driver(geckodriver)
3.3.3 For edge
3.3.4 For Safari
Through out these tutorials I will use Firefox, you are free to choose other browsers also Chrome, Safari, Microsoft edge, Opera.
4.0 Introduction of selenium
You can find proper documentation on selenium here
Following methods will help to find elements in a webpage (these methods will return a list):
- find_elements_by_name
- find_elements_by_xpath
- find_elements_by_link_text
- find_elements_by_partial_link_text
- find_elements_by_tag_name
- find_elements_by_class_name
- find_elements_by_css_selector
In this tutorial we will use only ‘find_elements_by_class_name’ and ‘find_elements_by_tag_name’ and there are other methods also which we will use in upcoming tutorials.
You can find complete documentation of these methods here
5.0 Basic: Data extraction from http://www.gutenberg.org
5.1 Setup
here ”C:\Users\siddhartha\Downloads\geckodriver-v0.25.0-win64\geckodriver.exe” is path of driver, where it is downloaded .
or you can do same Chrome
driver = webdriver.Chrome(r”C:\Users\siddhartha\Downloads\chromedriver_win32\chromedriver.exe”)
after running these codes, a new window will open, which look like this
5.2 Go to target webpage
‘’http://www.gutenberg.org/ebooks/search/%3Fsort_order%3Drelease_date' is our target page, after running this code you will see our target webpage on browser
5.3 Our objective
In this tutorial our objective is to extract data from this page, page contain book names, their author and release date, we will extract all these data of these 25 books, and then we will go next page to extract next page’s books data and so on……
5.4 Inspect elements
Click on inspect elements or press F12
this will open your inspector window in bottom, you can shift this inspector window to right, click on ‘…’ in right side then click on ‘dock to right’, as shown below
Click on the following button to inspect elements shown below
Try to inspect first item i.e. book
You will see that this item (book) belongs to class ‘booklink’, and other books also belongs to this class: means you can use this ‘class’ to find our target elements i.e. books, by using following code
books = driver.find_elements_by_class_name(‘booklink’)
len(books)
>> 25
this books list contains all elements of books, you can varify that what these elements contains , first item of list will contain first book data, last one will contain data of last book.
element.text will help to see the text within element
print(books[0].text)
>>>
Lady Patricia
Rudolf Besier
Oct 10, 2019print(books[-1].text)
>>>
Class Book for the School of Musketry, Hythe
Ernest Christian Wilford
Oct 6, 2019
Now inspect the name , author and release date of book
We will look structure of only one book, which will be same as other books, we will write code to extract only for one book then generalize this code to extract data of all books
You can see that name belongs to class ‘title’, author beolongs to class ‘subtitle’ and release date belongs to class ‘extra’, so using these class name we can find this elements from out book element, using following code
name = books[0].find_elements_by_class_name(‘title’)[0].text
author = books[0].find_elements_by_class_name(‘subtitle’)[0].text
date = books[0].find_elements_by_class_name(‘extra’)[0].text
print(name)
print(author)
print(date)
>>
Lady Patricia
Rudolf Besier
Oct 10, 2019
you can also see this for last book
name = books[-1].find_elements_by_class_name('title')[0].text
author = books[-1].find_elements_by_class_name('subtitle')[0].text
date = books[-1].find_elements_by_class_name('extra')[0].text
print(name)
print(author)
print(date)
>>
Class Book for the School of Musketry, Hythe
Ernest Christian Wilford
Oct 6, 2019
now you can iterate over books list to get data of all books
In the above code I have used try and except for handling erros because some data may be absent or may have different structure, which cause error, hence our code stop working, hence this error handling is very usefull.
For demonstration purpose I will run over 5 item in list here
for book in books[:5]:
name = book.find_elements_by_class_name('title')[0].text
author = book.find_elements_by_class_name('subtitle')[0].text
date = book.find_elements_by_class_name('extra')[0].text
print('name:', name)
print('author :', author)
print('date :', date)
print('_'*100)
>>name: Lady Patricia
author : Rudolf Besier
date : Oct 10, 2019
____________________________________________________________________
name: El arbol de la ciencia (Spanish)
author : PÃo Baroja
date : Oct 10, 2019
____________________________________________________________________
name: Frank Merriwell's Diamond Foes
author : Burt L. Standish
date : Oct 9, 2019
____________________________________________________________________
name: Conservation
author : Charles L. Fontenay
date : Oct 9, 2019
____________________________________________________________________
name: Teddy and the Mystery Deer
author : Howard Roger Garis
date : Oct 8, 2019
you can store this data to csv file or any other format
5.4 Go to Next Pages
Find element of Next page button
you can see that this Next button belongs to class ‘statusline’ under tag name <a> which is link, this link will lead to next page, we will have to use element.click() method to go on next page
see elements of class ‘stutusline’
driver.find_elements_by_class_name(‘statusline’)
>>
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="bd690ebc-acca-4ce3-9601-8479e4097fcd")>,
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="0ee2ee11-de6f-4a89-95ef-676b9f02bcce")>]
which contains two elements check both elements by .text
driver.find_elements_by_class_name(‘statusline’)[0].text
>>'Displaying results 1–25 | Next'driver.find_elements_by_class_name(‘statusline’)[1].text
>>'Displaying results 1–25 | Next'
both elements have same data, we can use any one of these elements
as you saw that the Next button link is in <a> tag, hence we can find that element using tag name ‘a’
statusline = driver.find_elements_by_class_name(‘statusline’)[0]
next_button = statusline.find_elements_by_tag_name(‘a’)
next_button>>
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="d827fb4b-c0c6-40a2-a0cf-3c46370a0360")>]
you can check text of that element
print(next_button[0].text)
>>
Next
now we have to click on this button
next_button.click()
after running this code you browser will open next page.
but if you on next page there are more button in ‘statusline’ class, hence when you run
print(next_button[0].text)
>>
First
your button will be First, intead of Next
When you check your elements of next_button, there are 3 elements
statusline = driver.find_elements_by_class_name(‘statusline’)[0]
next_button = statusline.find_elements_by_tag_name(‘a’)
next_button
>>
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="1883710a-209f-413a-ac2c-93cb18830e3d")>,
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="4df3709e-20a3-4041-a040-9f97c3e07d8e")>,
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="f228f830-9c92-4e5f-97c2-6300cc2962dc", element="14e60fca-8feb-4c00-a555-bd4692a126af")>]
check first and last one
next_button[0].text
>> 'First'
next_button[-1].text
>>'Next'
hence use next_button[-1].click() instead of next_button[0].click()
On new page you can do same process as previous page or we can use loop over these pages to extract data, in this case we don’t know how many such pages are there, hence we can apply while loop.
following code will extract data of 5 pages, means it will collect data from one page then it will click on next , then again it will collect data of next page , such process will be repeat 5 times
in the above code I have used try and except for handling erros because some data may be absent or may have different structure, which cause error, hence our code stop working, hence this error handling is very usefull.
for demonstation pupose I have extractd data of only 2 books from each page
count = 0
while True:
if count==5:
break
count +=1
print('page ',count)
books = driver.find_elements_by_class_name('booklink')
for book in books[:2]:
name = book.find_elements_by_class_name('title')[0].text
author = book.find_elements_by_class_name('subtitle')[0].text
date = book.find_elements_by_class_name('extra')[0].text
print('name:', name)
print('author :', author)
print('date :', date)
print('_'*100)
driver.find_elements_by_class_name('statusline')[0].find_elements_by_tag_name('a')[-1].click()
print('|'*100)
>>
page 1
name: Lady Patricia
author : Rudolf Besier
date : Oct 10, 2019
____________________________________________________________________
name: El arbol de la ciencia (Spanish)
author : PÃo Baroja
date : Oct 10, 2019
____________________________________________________________________
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
page 2
name: London Labour and the London Poor (Vol. 2 of 4)
author : Henry Mayhew
date : Oct 6, 2019
____________________________________________________________________
name: Kotkat (Finnish)
author : Hilja Haahti
date : Oct 6, 2019
____________________________________________________________________
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
page 3
name: When William IV. Was King
author : John Ashton
date : Oct 3, 2019
____________________________________________________________________
name: The Viking's Skull
author : John R. Carling
date : Oct 3, 2019
____________________________________________________________________
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
page 4
name: History of the Peninsular War, Volume 5 (of 6)
author : Robert Southey
date : Sep 30, 2019
____________________________________________________________________
name: History of the Peninsular War, Volume 4 (of 6)
author : Robert Southey
date : Sep 30, 2019
____________________________________________________________________
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
page 5
name: The Collected Writings of Dougal Graham, "Skellat" Bellman of Glasgow, Vol. 1 of 2
author : Dougal Graham
date : Sep 26, 2019
____________________________________________________________________
name: The German Fury in Belgium
author : L. Mokveld
date : Sep 26, 2019
____________________________________________________________________
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Now finally you have extracted data 😀
One additional thing is that once you have written proper code then the browser is not important you can collect data without browser, Which is called headless browser window, hence replace the following code with the previous one.
Headless Firefox
Headless Chrome
In this case browser will not run in background which is very helpful.
Finally, this tutorial ends here, Though this tutorial is very simple but I hope you have learned important things 🙂
6.0 Next Tutorial
You can read next tutorial here
Please clap if you like this tutorial
Join our Telegram channel for more updates, study resources and discussion
Join and earn ₹31