Getting started with Playwright | Scraping a basic webpage with Playwright in Python.

Animesh Singh
3 min readApr 27, 2022

--

The playwright is a fairly new web testing tool from Microsoft introduced to let users automate webpages more efficiently with fewer initial requirements as compared to the already existing tool Selenium. Although Playwright is significantly better than Selenium in terms of speed, usability and reliability, it is fairly new and lacks support on various levels such as community, browsers, real devices, language options, and integrations.

Overview:

After reading this article, you will be able to :

  • Install and setup Playwright
  • Automate a webpage and extract the text from a specific class
  • Click buttons and fill out basic HTML forms.

Creating a Python virtual environment :

It is always advisable to work in a separate virtual environment specifically if you are using a particular library. I am creating a virtual environment “venv” and activating it.

Creating a virtual environment “venv”.

virtualenv venv

Activating it:

venv/Scripts/activate

Installing and setting up Playwright:

Installing Playwright is a fairly easy two-step job with pip install. It will take some time as it also downloads chromium, Firefox and WebKit(Safari) browsers.

pip install playwright
playwright install

Automating and scraping data from a webpage:

After installing the Playwright library, now it's time to write some code to automate a webpage. For this article, I am using quotes.toscrape.com.

First, we will import some necessary packages and set up the main function.

from playwright.sync_api import sync_playwright
def main():
pass
if __name__ == '__main__':
main()

Now we will write all our codes in the ‘main’ function. This will make our script a lot easier to read and debug.

with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://quotes.toscrape.com/')
page.wait_for_timeout(10000)
browser.close()
quotes. toscrape.com

This code will open the above webpage, wait for 10000 milliseconds and then will close the webpage.

Scraping data :

Now it's time to write the code to scrape two text fields; quotes & their author.

all_quotes = page.query_selector_all('.quote')
for quote in all_quotes:
quote = quote.query_selector('.text').inner_text()
author = quote.query_selector('.author').inner_text()
print({'Author': author ,'Quote': quote})
page.wait_for_timeout(10000)
browser.close()

The above code will select all boxes with the ‘author’ class. With for loop we will iterate through all elements and will extract the quote and its author name. It always makes sense to use a Python dictionary to store different data fields with key and value pairs. We are printing out the dictionary in the terminal.

Complete code to scrape quotes and their authors :

from playwright.sync_api import sync_playwright
def main():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://quotes.toscrape.com/')
all_quotes = page.query_selector_all('.quote')
for quote in all_quotes:
quote = quote.query_selector('.text').inner_text()
author = quote.query_selector('.author').inner_text()
print({'Author': author ,'Quote': quote})
page.wait_for_timeout(10000)
browser.close()
if __name__ == '__main__':
main()

Clicking a link:

We will now try to click on a button. On this website, you will see different tags on the right side. We will now see how we can click on those tags which are nothing but a ‘href’.

from playwright.sync_api import sync_playwright
def main():
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto('https://quotes.toscrape.com/')
love_tag = page.query_selector("a[href='/tag/love/']")
love_tag.click()
all_quotes = page.query_selector_all('.quote')
for quote in all_quotes:
quote = quote.query_selector('.text').inner_text()
author = quote.query_selector('.author').inner_text()
print({'Author': author ,'Quote': quote})
page.wait_for_timeout(10000)
browser.close()
if __name__ == '__main__':
main()

The above code is similar to the previous code. We just added two more lines to select and click the ‘love_tag’ href.

This is all you have to do to navigate through a webpage and scrape any kind of data you want. There is a lot more you can do with Playwright. Now as you got the basic idea, I advise you to try automating different complex websites.

Thanks and goodbye for now. And yes, follow if you want more exciting and fresh stories from me.

--

--