Web scraping Trustpilot reviews

Figure 1: Screenshot of reviews page in Trustpilot.com.

So, what is the easiest way to scrape reviews from Trustpilot.com?

In this article we will try to scrape user reviews for a low code web scraping software called Octoparse.com.

At the end of this article, you will be able to extract these reviews as a CSV file shown in figure 2.

Figure 2: Screenshot of individual review of a product at Trustpilot.com.

Option 1: Hire a fully managed web scraping service.

You can contact us contact us for our fully managed web scraping service to get Trustpilot reviews data as a CSV or excel file without dealing with any coding.

Our pricing starts at $99 for fully managed Trustpilot scraping with upto 20,000 rows of data.

You can simply sit back and let us handle all complexities of web scraping a site like Google that has plenty of anti-scraping protections built in to try and dissuade from people scraping it in bulk.

We can also create a rest API endpoint for you if you want structured data on demand.

Option 2: Scrape Trustpilot.com on your own

We will use a browser automation library called Selenium to extract data.

Selenium has bindings available in all major programming language so you use whichever language you like, but we will use Python here.

# Extracting Trustpilot.com reviews for octoparse.com from selenium import webdriver 
import time
from bs4 import BeautifulSoup
test_url = 'https://www.trustpilot.com/review/www.octoparse.com' option = webdriver.ChromeOptions()
chromedriver = r'chromedriver.exe'
browser = webdriver.Chrome(chromedriver, options=option) browser.get(test_url)
html_source = browser.page_source
browser.close()

Extracting Trustpilot reviews

Once we have the raw html source, we should use a Python library called BeautifulSoup for parsing the raw html files.

Extracting review author

From inspecting the html source, we see that review authors have div tags and belong to class ‘consumer-information__name’.

# extracting authors 
soup=BeautifulSoup(html_source, "html.parser") review_author_list_src = soup.find_all('div', {'class','consumer-information__name'})
review_author_name_list = [] for val in review_author_list_src: try: review_author_name_list.append(val.get_text()) except: pass review_author_name_list[:3] #Output ['flavio', 'Cici Wu', 'T L', 'Phil Caulkins', 'Putra', 'sendak', 'Aisha Mohamed', 'Milly Fang', 'Chris MeatStream Walker', 'Ashley Han', 'Jiahao Wu', 'Frank Parker']

The next step is extracting review dates of each review.

# extracting review dates date_src = soup.find_all('time') date_src date_list = [] for val in date_src: date_list.append(val.get_text()) date_list # Output ['2021-04-16', '2021-03-22', '2021-01-10', '2020-12-02', '2020-11-15', '2020-10-10', '2020-07-30', '2020-05-28', '2020-03-12', '2019-11-25', '2019-08-27', '2019-08-22']

Extracting review title and review contents

For brevity we will only show results from first five results, and you can verify that the first result matches the text in figure 2 above.

# extracting review title review_title_src = soup.find_all('h2',{'class', 'review-content__title'}) review_title_list = [] for val in review_title_src: review_title_list.append(val.get_text()) print(review_title_list[:3]) # extracting review content review_content_src = soup.find_all('div',{'class', 'UD7Dzf'}) review_content_list = [] for val in review_content_src: review_content_list.append(val.get_text()) review_content_list[:3] # Output ['the software is janky and takes some', 'Nice tool to get great amount of web data', 'Complicated', 'the programs are written in chinese-', 'Impossible to Refund & One of the Worst Software'] *** ['the software is janky and takes some time to master. however, the customer support was super helpful. they were responsive and even troubleshoot my task in order to make it work properly.', 'There are transparent pricing and reminder emails so I am not confused about the payment issue. I want to recommend it if you are not a programmer, and want to gather a great number of data for analysis and research. This software is easier and more friendly to people who cannot code, and you can collect info from multiple websites, into a structured spreadsheet. This tool just tremendously improves my efficiency while conducting my research. Thanks a lot!', 'If you want the software to do more than the basic stuff, it gets hard to impossible, even if it looks simple to do, maybe you have to think chinese... Though it`s possible, and then works well.', 'the programs are written in chinese- the reviews are fake- they dont return the funs and Paypal is closing their account but they keep on opening new ones with new email address. CHINESE SCAM', 'First they will make your refund almost impossible to get, if you pay through paypal. they refund you very small amount to make paypal detects that you already got refunded. I mean what kind of software cost this much per month\n\nSecond, I have use their software, they ONLY can extract website with consistent amount of div, (but most data heavy website is an ugly ones). So it is very useless software. \n\nIm still unable to extract good data from the government.. so with that prices, better not']

Converting into CSV file

You can take the lists above, and read it as a pandas DataFrame. Once you have the Dataframe, you can convert to CSV, Excel or JSON easily without any issues.

Figure 3: Screenshot of Trustpilot.com reviews saved as a CSV file.

Scaling up to a full crawler for extracting all Trustpilot reviews of an app

Pagination

  • To fetch all the reviews, you will have to paginate through the results.

Implementing anti-CAPTCHA measures

  • After few dozen requests, the Truspilot servers will start blocking your IP address outright or you will be flagged and will start getting CAPTCHA.
  • For successfully fetching data, you will have to implement:
  • rotating proxy IP addresses preferably using residential proxies.
  • rotate user agents
  • Use an external CAPTCHA solving service like 2captcha or anticaptcha.com

After you follow all the steps above, you will realize that our pricing for managed web scraping is one of the most competitive in the market.

Originally published at https://www.specrom.com.

We cover all the cutting edge natural language processing, machine learning and AI powered strategies to extract web data on big data scale.

Recommended from Medium

Pair-Programming: A developer’s perspective

Adulthood can otherwise be an advantage in starting a software development career.

Enterprise DevOps for Database Lifecycle Management

Simple REST API with Django using TDD Approach

ATTENTION ‼️

Want To Hire Mobile App Development Company? Here are Some Top Tips

Community Bonding & the First Phase — GSoC 2020

3 other skybox Unity

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jay M. Patel

Jay M. Patel

Cofounder/principal data scientist at Specrom Analytics (specrom.com) natural language processing and web crawling/scraping expert. Personal site: JayMPatel.com

More from Medium

Google App script is not the Programming Language for the Future! (And Here’s why )

Collapsible Network(Graph) Diagram with D3

How to Remove HTML Tags <p> from your Data

Building a Website From Scratch — Beginner’s version Part I