Pyppeteer, the snake charmer
Or how to remotely control a browser from python.
Puedes leer este artículo también en español aquí.
After years developing software, one of the tasks that I like the most when I get involved in a new project, is to investigate the possible solutions to be used. Thanks to the enormous amount of free software available today, doing so can help you find the most appropriate approach to a problem and sometimes, hopefully, the direct solution. But even if you’re not lucky, I’m sure that along the way you find utilities, libraries, and software that in the future may be useful or at least curious.
That is how I came to pyppeteer, a port of puppeteer to python, looking for information to satisfy part of the requirements of one of the last projects in which I participate in Commite Inc., which consists of extracting and analyzing data from different web pages and apps.
The truth is that it is not precisely an unexplored field in the world of web development, especially with the languages that we usually use in the stack of Commite: python and javascript. Both languages have a massive amount of projects and libraries available related to web scraping and testing of web applications, so many that it becomes difficult to decide which ones to use.
But returning to the requirements of the project, and in particular, to the extraction of information from different websites, some are very dynamic, some of them are made with React, others with angular and others have parts in javascript with old friend JQuery. All this, a priori, should not be a problem. The problem is that ‘the web’ today is quite complicated, and the data we need can only be found by interacting with the interface or by clicking a button that makes an ajax call to the server. Even a section may only appear if the cursor is placed on an element or worse, the whole structure of the page may be dynamic as in an SPA.
What if we could directly extract the information from a browser and manage it in an automated way? What if we could control the pointer or simulate the entry of data by keyboard?
The solution: A browser!, enters pyppetter
Pyppeteer, written in python, is a port of puppeteer, a Javascript library for the control and automation of Chrome / Chromium, developed by Google. It is a modern snake charmer for our browser. Pyppeteer allows us almost total control of a Chromium / Chrome, open tabs, analyze the DOM in real time, execute Javascript, connect to a running browser and even download a Chromium.
Up until relatively recently, being able to use a browser for this type of tasks required using projects such as PhantomJS or “trimmed” browsers, usually developed from the Chromium project code. With the incorporation of “headless” modes to Firefox and Chrome, even that is not necessary. The headless mode allows to render and parse a web page without the need of the user interface, obtaining the same result as in the traditional mode. This makes the browsers can be run remotely on a server, without a desktop environment, and even use them in a Docker container.
What are the available alternatives?
The idea of controlling a browser starts from the venerable Selenium. Without going into too much detail, Selenium is a series of technologies to control the browser remotely, besides, for quite some time Selenium is the de facto standard for the task. Developed in Java, it works practically in any browser and has libraries for almost any language. However, the W3C is in the process of standardizing WebDriver (which is the standardization of a protocol for remote management of browsers) with GeckoDriver and ChromeDriver being their respective implementations for Firefox and Chrome.
- In particular, Firefox has Marionette, which is quite simple to use and is decently documented. In fact, it was my initial choice for the project. However, it has some drawbacks: at the moment it only supports Python 2.7 (let’s go Mozilla!), due to base dependencies of the library, other one, is that it is not asynchronous, so it feels a bit strange to work with.
- In the case of Chromium, it has DevTools protocol as a low-level communication protocol, offering a lot of functionality and on top, the best-known Puppeteer in Javascript, widely used, well documented and used as a base for other libraries.
- And related, in python, of course, there is Scrapy, and I have also found this little gem http://html.python-requests.org/ from the creator of requests and pipenv among others (dipping in its code was how I discovered pyppeteer).
Now, if you already know what scraping is and you want to see pyppeteer directly in action, you can go directly to the tutorial after the next section.
A brief introduction to web scraping
For those who don’t know what web scraping is, here, I’m going to make a little demonstration.
The basic idea is to download, as a browser would, the HTML ‘document’ and extract the information you want from it. What we will obtain will be a more complicated version of the following scheme.
<html>
<head>
<title>PAGE TITLE</title>
...
</head>
<body>
<div>
<a href='http://example.com'>A LINK</a>
</div>
...
</body>
</html>
The next step will be to “parse” the document to analyze the different elements of the structure. We want to be able to distinguish them and stay with what we are interested in, using the previous scheme. For example, extracting the title from the page, “PAGE TITLE”, or the “href” property of the link element “http://example.com”.
Let’s use Python and extract some information from Wikipedia. We’ll ask for some pages of programming languages and obtain the data from the summary tables.
languages = {
"python": "https://es.wikipedia.org/wiki/Python",
...
}
result = {}
for name, url in languages.items():
response = get_page(url)
document = read_document(response)
result.update({name: extract_data(document)})
This is the central part of the program. First, we have a dictionary with the URLS of the target pages. For each of them, we will request the page withget_page(url)
, that will return the response from the server, we will read the answer with the function read_document(response)
, that will return the document ready to be interpreted.
def get_page(url):
return request.urlopen(url)
def read_document(response):
return response.read()
Now, with the extract_data()
function, we will parse and extract the interesting information.
def extract_data(document):
# Generate document tree
tree = lxml.html.fromstring(document)
# Select tr with a th and td descendant from table
elements = tree.xpath('//table[@class="infobox"]/tr[th and td]')
# Extract data
result = {}
for element in elements:
th, td = element.iterchildren()
result.update({
th.text_content(): td.text_content()
})
return result
With lxml.html.fromstring()
, we parse the document obtaining an element tree, with XPath we select the nodes tr of the table that has a node th and td as descendants, and from them, we obtain the text they contain, with the method text_content()
. The data extracted by each of the urls will be somewhat similar to the following:
...
'python': {'Apareció en': '1991',
'Dialectos': 'Stackless Python, RPython',
'Diseñado por': 'Guido van Rossum',
'Extensiones comunes': '.py, .pyc, .pyd, .pyo, .pyw',
'Ha influido a': 'Boo, Cobra, D, Falcon, Genie, Groovy, Ruby, '
'JavaScript, Cython, Go',
'Implementaciones': 'CPython, IronPython, Jython, Python for S60, '
'PyPy, Pygame, ActivePython, Unladen Swallow',
'Influido por': 'ABC, ALGOL 68, C, Haskell, Icon, Lisp, Modula-3, '
'Perl, Smalltalk, Java',
'Licencia': 'Python Software Foundation License',
'Paradigma': 'Multiparadigma: orientado a objetos
...
Here, the full script.
Pyppeteer
Let’s see how to install and use pyppeteer to do the same scraping of Wikipedia. First, we will generate a python virtualenv with pipenv, and we will install the library.
$ pipenv --three
$ pipenv shell
$ pipenv install pyppeteer
With this, we will have the basics to start using pyppeteer, but first, the first time it runs (unless you specify a Chrome / Chromium executable path), the library will download a Chromium, approximately 100mb.
import pprint
import asyncio
from pyppeteer import launch
async def get_browser():
return await launch({"headless": False})...async def extract_all(languages):
browser = await get_browser()
result = {}
for name, url in languages.items():
result.update(await extract(browser, name, url))
return resultif __name__ == "__main__":
languages = {
"python": "https://es.wikipedia.org/wiki/Python",
...
}
loop = asyncio.get_event_loop()
result = loop.run_until_complete(extract_all(languages))
pprint.pprint(result)
This will be the skeleton of our program, very similar to the previous one, except for the use of asyncio and the async / await syntax. The extract_all(languages)
function, will be the entry point of our application, will receive the target URL dictionary, and will invoke the get_browser()
function, that will launch a browser. Having passed the parameter {‘headless’: False} to launch
, we can see the browser run and load the URL automatically.
The next thing is to go through the URL dictionary invoking the extract
function, passing it the URL which in turn will invoke get_page
, which will open a new tab in the browser and load the URL.
async def get_page(browser, url):
page = await browser.newPage()
await page.goto(url)
return pageasync def extract(browser, name, url):
page = await get_page(browser, url)
return {name: await extract_data(page)}
Finally, extract_data
will perform the data extraction. We will use the XPath selector //table[@class="infobox"]/tbody/tr[th and td]
to select from the page the tr
nodes, descendants of the table and having both children th
and td
. For each of them, we will extract the text of the node. And now the most strange part arrives, and the one that probably less likes the most purists: to remove the text we will pass a function written in Javascript that will be executed in the browser and from which the result will be returned.
async def extract_data(page):
# Select tr with a th and td descendant from table
elements = await page.xpath(
'//table[@class="infobox"]/tbody/tr[th and td]')
# Extract data
result = {}
for element in elements:
title, content = await page.evaluate(
'''(element) =>
[...element.children].map(child => child.textContent)''',
element)
result.update({title: content})
return result
The result will be exactly the same as in the previous section. Here is an excerpt:
...
'python': {'Apareció en': '1991',
'Dialectos': 'Stackless Python, RPython',
'Diseñado por': 'Guido van Rossum',
'Extensiones comunes': '.py, .pyc, .pyd, .pyo, .pyw',
'Ha influido a': 'Boo, Cobra, D, Falcon, Genie, Groovy, Ruby, '
'JavaScript, Cython, Go',
'Implementaciones': 'CPython, IronPython, Jython, Python for S60, '
'PyPy, Pygame, ActivePython, Unladen Swallow',
'Influido por': 'ABC, ALGOL 68, C, Haskell, Icon, Lisp, Modula-3, '
'Perl, Smalltalk, Java',
'Licencia': 'Python Software Foundation License',
'Paradigma': 'Multiparadigma: orientado a objetos
...
And now the full script:
Scraping something more complex
We can see the real potential of the library when extracting data from a dynamic page. For this, I have asked permission to the developers of coinmarketcap.io to use it as our scraping goal. Thank you very much and congratulations for the excellent app!.
Coinmarketcap is an SPA, once loaded in the browser, the different functionalities of the application will be adding, modifying and removing DOM nodes, depending on the user’s interactions, so downloading a copy of the HTML and parsing it, will not help much.
The objective of our scraper will be to access the detail of the first 30 cryptocurrencies ordered by total capitalization and obtain data of the last 24 hours, in euros.
Hands-on: We will have a scrape_cmc_io()
entry point function, that will execute the different tasks and collect the information obtained. get_browser
will launch the Chromium browser and get_page
, that will load the app into a new tab.
import asyncio
from pyppeteer import launchasync def get_browser():
return await launch()
async def get_page(browser, url):
page = await browser.newPage()
await page.goto(url)
return page...async def scrape_cmc_io(url):
browser = await get_browser()
page = await get_page(browser, url)
await create_account(page)
await select_top30(page)
await add_eur(page)
currencies_data = await navigate_top30_detail(page)
show_biggest_24h_winners(currencies_data)...if __name__ == "__main__":
url = "http://coinmarketcap.io"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(scrape_cmc_io(url))
The first thing we see when accessing the app for the first time is a request to create an account and an invitation to log in. Let’s go there, we’ll create an account by calling create_account
, as we can read, it’s just a click, so we click on the button, using the.click(selector)
method passing the id of the button as the selector.
async def create_account(page):
# Click on create account to aceess app
selector = "#createAccountBt"
await page.click(selector)
After this, we will access the main screen. By default, the amounts appear in dollars, but our requirement was to extract the information in euros. To see euros, we have to access a coin search and find the desired one, add it and select it as currency to show quantities.
To do this, we create the add_eur
ffunction, which will select the different elements and click them, the novelty, is that we enter text in the search engine through the method page.type(selector, 'eur')
, that simulates keyboard entry.
Another peculiarity is the use of the .waitForSelector
method that waits a configurable number of milliseconds for the desired node to appear in the DOM if not, an exception will be thrown.
async def add_eur(page):
# Select EUR fiat currency for the whole app
selector_currency = "#nHp_currencyBt"
await page.click(selector_currency)
selector_add_currency = "#currencyAddBt"
await page.click(selector_add_currency)
selector_search = "input#addCurrencySearchTf"
await page.type(selector_search, 'eur')
selector_euro = "#addCurrencySearchResults > #add_currency_EUR"
await page.waitForSelector(selector_euro)
selector_euro_add = "#add_currency_EUR > .addRemCurrencyBt"
await page.click(selector_euro_add)
selector_use_euro = "#currencyBox > div[data-symbol='EUR']"
await page.click(selector_use_euro)
The next requirement is to collect the information of the first 30 coins. By default the application will show us 25, so we will have to access the corresponding menu to select the desired amount.
This task will be executed by the function select_top30
, with very similar behavior to the previous functions, click on the desired selector, wait and click again.
async def select_top30(page):
# Show top 30 currencies by market capitalization
selector_top_list = "#navSubTop"
await page.waitForSelector(selector_top_list)
await page.click(selector_top_list)
selector_top_30 = ".setCoinLimitBt[data-v='30']"
await page.click(selector_top_30)
Now we only have access to the detail of each of the coins and extract the information. To open the detail, we will select the container node of the coins, and we will go through the children, clicking on each one.
async def navigate_top30_detail(page):
# Iterate over the displayed currencies and extract data
select_all_displayed_currencies = "#fullCoinList > [data-arr-nr]"
select_currency = "#fullCoinList > [data-arr-nr='{}'] .L1S1"
currencies = await page.querySelectorAll(select_all_displayed_currencies)
total = len(currencies)
datas = []
for num in range(total):
currency = await page.querySelectorEval(
select_currency.format(num),
"(elem) => elem.scrollIntoView()"
)
currency = await page.querySelector(select_currency.format(num))
datas.append(await extract_currency(page, currency))
return datas
So that the browser can click on the nodes, these should be visible in the “viewport”, which is the part of the web that we could see, we will execute a function in javascript that will scroll as we go through the nodes.
currency = await page.querySelectorEval(
select_currency.format(num),
"(elem) => elem.scrollIntoView()"
)
As a curiosity, while developing the program, I discovered that when a coin is below the advertising banner, the click was made on the banner.
Within the detail, we will extract the information using the same pattern that we have been using, we will select the desired node, and we will act on it. This time it will be done by the extract_currency
function. We will clean the data, excluding the unwanted symbol and transforming the quantities into numbers, and remove the remaining spaces and line breaks. From each currency detail, we will extract the name, the symbol, the current price, the price variation in the last 24 hours, the percentage change in 24 hours and the position in the ranking according to the total capitalization.
async def extract_currency(page, currency):
# Extract currency symbol
symbol = await page.evaluate(
"currency => currency.textContent",
currency
)
symbol = symbol.strip()
# Click on current currency
await currency.click()
selector_name = ".popUpItTitle"
await page.waitForSelector(selector_name)
# Extract currency name
name = await page.querySelectorEval(
selector_name,
"elem => elem.textContent"
)
name = name.strip()
# Extract currency actual price
selector_price = "#highLowBox"
price = await page.querySelectorEval(
selector_price,
"elem => elem.textContent"
)
_price = [
line.strip() for line in price.splitlines() if len(line.strip())]
price = parse_number(_price[1])
# Extract currency 24h difference and percentage
selector_24h = "#profitLossBox"
price_24h = await page.querySelectorEval(
selector_24h,
"elem => elem.textContent"
)
_price_24h = [
line.strip() for line in price_24h.splitlines() if len(line.strip())]
perce_24h = parse_number(_price_24h[6])
price_24h = parse_number(_price_24h[-2])
# Extract currency capitalization rank
selector_rank = "#profitLossBox ~ div.BG2.BOR_down"
rank = await page.querySelectorEval(
selector_rank,
"elem => elem.textContent"
)
rank = int(rank.strip("Rank"))
selector_close = ".popUpItCloseBt"
await page.click(selector_close)
return {
"name": name,
"symbol": symbol,
"price": price,
"price24h": price_24h,
"percentage24h": perce_24h,
"rank": rank
}
To finish and as icing on the cake, we will use the terminaltables
and colorclass
python packages to improve the output in the terminal. We show in red those that have gone down in the last 24 hours and in green those that have not.
Complete program.
Conclusion.
Pyppeteer allows you to control a modern browser from python code with a relatively simple and high-level API, being able to become an alternative to the use of traditional Selenium. The author’s goal is to emulate the puppeteer API completely. It is actively developed, at the time of writing the article the library has just passed to version 0.0.17 and although it is marked as ‘alpha,’ it is stable enough to be used.
I would like to thank the author of pyppeteer for his dedication and time in developing the library, the coinmarketcap.io team for letting me use his application for the tutorial, and Commite for allowing me to improve my asyncio knowledge while writing the article.