5 Quick Tips for Web Scraping in Python

Ways to avoid the painful pitfalls of web scraping

keanshern kok
May 6 · 4 min read
Photo by Nathan Dumlao on Unsplash

Introduction

Using these tips, I’ve gone from scraping ~1000 Amazon reviews for sentiment analysis to scraping ~30,000 COVID-19 news articles for a study on government-citizen engagement during a pandemic. Note that I do my scraping using the Selenium library in Python.

Tip 1-Define Your Requirements

Tip 2-Analyze the Website Structure

I’ll give you a practical example below:

Illustration by me

Now you need to figure out how to get to each webpage. Once you figure out how to link the pages, you’ve got yourself a plan. You’ll encounter many intermediary pages between each relevant page. My scraping plan looked like this:

Step 1: Go to the Amazon homepage.

Step 2: Search for product brand (leads to results page).

Step 3: Results page holds links to multiple product pages. Scrape all these links and store in a file.

Step 4: Iteratively open each link in the file. Scrape product name and price. Click the “see all reviews” button (opens reviews page).

Step 5: Scrape all reviews.

Tip 3-Understand URLs

Besides that, URLs provide an alternative strategy to button clicks for linking webpages. If you ever get stumped trying to locate or click a particular button that links webpages, try comparing the URLs between the webpages. One of the commonalities between different websites is that going from the first results page to the next can be done by appending something similar to “&page=n” to the end of the URL, whereby n represents page 2, 3…onwards.

Tip 4-Don’t Jumble Your “Waits”

One of the trickiest issues to troubleshoot is when your programs fails to timeout even after exceeding your specified wait duration. This is usually due to unknowingly mixing implicit and explicit waits.

The following resources are good if you intend to use Selenium:

  1. Selenium in Python Waits Documentation
  2. Implicit-Explicit Wait Interaction Issues

Tip 5-Choose Strategic Locators

  1. CSS tends to be faster than Xpath in most browsers.
  2. If ID is one of the options, use it.
  3. Some locators can point to multiple elements if not written completely. Make sure your locator points to the exact element.
  4. Use a browser extension such as Ranorex Selocity to help locate elements.

*Bonus Tip-Follow My Tutorial

Conclusion

Geek Culture

Proud to geek out. Follow to join our +500K monthly readers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store