Analytics Vidhya
Published in

Analytics Vidhya

Web scraping (playstore permission) using pyppeteer

“Get Exactly what you want. 100% satisfaction guaranteed. Crawl and extract data from any website.”

As data geeks, sometimes we need some datasets in which the information comes from a website page and analyze it, one answer to this need is web scraping. Web scraping is about creativity to make a script that should retrieve 100 percent information you need from a website that you want.

I usually work with Python to transform and analyze data, that I think it’s more comfortable for me to scraping using this language. BeautifulSoup is one of the frameworks I use to do web scraping where all the information contained at the HTML page. Sadly, it can’t cover a case when I need information after button click at a webpage, and that page opens a modal — for example, permission modal at play store app page.

The Cyrstal Guardian (Playstore App) — Permission Modal

In this article, I will give a simple tutorial of how to scraping permission modal at play store (The Crystal Guard app ) using pyppeteer (https://github.com/miyakogi/pyppeteer)

Pyppeteer is a python version from puppeteer, a javascript library for the control and automation of Chrome / Chromium, developed by Google. Some feature from Pyppeteer allows us to control of a Chromium / Chrome with almost total control; realtime DOM analyzes, open tabs, connect to a running browser, execute Javascript, and download a Chromium.

Installation

using pip:

python3 -m pip install pyppeteer

install from github:

python3 -m pip install -U git+https://github.com/miyakogi/pyppeteer.git@dev

Library import

import asyncio
from pyppeteer import launch
import pandas as pd

Scraping Function

scraping function code in python

This code open play store app page to url that we already define, then i crawling element to find ‘hrTbp’ class which is a box additional information from playstore and when the element text content is ‘View Details’ it will click that to open the modal and wait until the modal totally open using find the title element and then execute next code.

Function querySelector() and querySelectorAll() are part of page class from pyppeteer to retrieve element from the page, to get text from that elemnt we use element.textContent. I usually do inspect element at the page which i want to scrape to know the class name and the element, for example div.miUA7 is “The Crystal Guardian” title element.

HTML element for App title at permission modal

Call the function using this code and it will return a dictionary. You also can convert it to be dataframe after getting the result.

result = await main()
Dataframe from the result

Conclusion

Web scraping is not a hard thing, it is a creative activity which you try mix and match the effective and right code and dig in HTML page code to find an element and class to get it information, it’s really helpful when you need many information and need an automation. Pyppeteer has api reference that you can access at https://miyakogi.github.io/pyppeteer/reference.html for further information outside this tutorial. This tutorial only show how to scrape using one opened browser, but Pyppeteer also support another way to do multiprocessing scraping and chunking using pydash.chunk and asyncio gather to do scraping parallel with many browser opened at the same time.

I would like to thank pyppeteer author that make scraping with python happen and thanks to my junior for permitting me to make an example with their app (The Crystal Guardian).

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store