Web scraping Google play store reviews

So, what is the easiest way to scrape reviews from an app in Google Play store ?

Figure 1: Screenshot of app reviews in Google Play store.

At the end of this article, you will be able to extract these reviews as a CSV file shown in figure 2.

Figure 2: Screenshot of app reviews in Google Play store extracted as CSV file.

Option 1: Hire a fully managed web scraping service.

You can contact us contact us for our fully managed web scraping service to get Google Play store app reviews data as a CSV or excel file without dealing with any coding.

Our pricing starts at $99 for fully managed Google Play store scraping.

You can simply sit back and let us handle all complexities of web scraping a site like Google that has plenty of anti-scraping protections built in to try and dissuade from people scraping it in bulk.

We can also create a rest API endpoint for you if you want structured data on demand.

Option 2: Scrape Google play store on your own

We will use a browser automation library called Selenium to extract results for the a particular app in play store.

Selenium has bindings available in all major programming language so you use whichever language you like, but we will use Python here.

# Using Selenium to extract google play store reviews from selenium import webdriver import time from bs4 import BeautifulSoup test_url = 'https://play.google.com/store/apps/details?id=itsolutionever.karanponda.dmbi' option = webdriver.ChromeOptions() option.add_argument("--incognito") chromedriver = r'chromedriver.exe' browser = webdriver.Chrome(chromedriver, options=option) browser.get(test_url) html_source = browser.page_source browser.close()

Using BeautifulSoup to extract Google play reviews

Once we have the raw html source, we should use a Python library called BeautifulSoup for parsing the raw html files.

Extracting review author

From inspecting the html source, we see that review authors have span tags and belong to class ‘X43Kjb’.

# extracting authors soup=BeautifulSoup(html_source, "html.parser") review_author_list_src = soup.find_all('span', {'class','X43Kjb'}) review_author_name_list = [] for val in review_author_list_src: try: review_author_name_list.append(val.get_text()) except: pass review_author_name_list[:3] #Output ['Baraka Mark Bright', 'Raj Kamdiya', 'Incognito Inventions']

The next step is extracting review dates of each review.

# extracting review dates date_src = soup.find_all('span',{'class', 'p2TkOb'}) date_src date_list = [] for val in date_src: date_list.append(val.get_text()) date_list[:3] # Output ['September 24, 2019', 'March 25, 2019', 'April 9, 2019']

Extracting review contents

For brevity we will only show results from first three results, and you can verify that the first result matches the text in figure 2 above.

# extracting review content review_content_src = soup.find_all('div',{'class', 'UD7Dzf'}) review_content_list = [] for val in review_content_src: review_content_list.append(val.get_text()) review_content_list[:3] # Output [" Everything is perfect. The UI and the content itself are all fantastic. Thanks so much. This app deserves a five 🌟 but I can't give you 100%", ' This is really useful app for me to learn all data mining concepts as well as data warehousing concept in this step by step explanation of all data mining concept with well n good figures.', ' this application is really useful for me to learn data mining tutorial with well n good examples and covered all data mining concepts really useful for me.']

Converting into CSV file

You can take the lists above, and read it as a pandas DataFrame. Once you have the Dataframe, you can convert to CSV, Excel or JSON easily without any issues.

Scaling up to a full crawler for extracting all Google play reviews of an app

Pagination

  • To fetch all the reviews, you will have to paginate through the results.

Implementing anti-CAPTCHA measures

  • After few dozen requests, the Google.com servers will start blocking your IP address outright or you will be flagged and will start getting CAPTCHA.
  • For successfully fetching data, you will have to implement:
  • rotating proxy IP addresses preferably using residential proxies.
  • rotate user agents
  • Use an external CAPTCHA solving service like 2captcha or anticaptcha.com

After you follow all the steps above, you will realize that our pricing for managed web scraping is one of the most competitive in the market.

Originally published at https://www.specrom.com.

We cover all the cutting edge natural language processing, machine learning and AI powered strategies to extract web data on big data scale.

Recommended from Medium

All of your replication servers are belong to us

2.5D Platformer — #3 Double Jump Ability

A free domain email service: postale.io

postale.io

Generics: One Container to Rule Them All

Building a PEG Parser

Core PHP Compress Image Size while Uploading

Mutable and Immutable Objects in Python

How to scan disk after add hardisk from virtual machine control

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jay M. Patel

Jay M. Patel

Cofounder/principal data scientist at Specrom Analytics (specrom.com) natural language processing and web crawling/scraping expert. Personal site: JayMPatel.com

More from Medium

Building a Website From Scratch — Beginner’s version Part I

How to Build a WooCommerce chatbot and Add it to Your Website?

How to Break into Software Development — The Ultimate Guide for University and College students…

7 New Technologies to Improve Customer Service in 2021