5 Quick Tips for Web Scraping in Python
Whether for work or for fun, web scraping is a critical skill that every data scientist should have in their arsenal. It opens the door to an immeasurable amount of data beyond the restrictions of what you can find on Kaggle or other open data websites. But the best part is that when you gain the ability to scrape efficiently and autonomously at a small scale, you will easily be able to translate that to larger scraping projects.
Using these tips, I’ve gone from scraping ~1000 Amazon reviews for sentiment analysis to scraping ~30,000 COVID-19 news articles for a study on government-citizen engagement during a pandemic. Note that I do my scraping using the Selenium library in Python.
Tip 1-Define Your Requirements
Stop. Don’t open your IDE. Close all your open tabs on web scraping tutorials. First and foremost, ask yourself: “what information do I want to scrape?”. For instance, when I scraped Amazon reviews, I decided early on that I wanted to collect three variables: (1) Product name, (2) Product price and (3) Reviews. You won’t always need many variables to do analysis. With some basic feature transformations, you can turn “Reviews” into “Number of Reviews” (per product), “positivity rate” (VADER sentiment analysis), etc.
Tip 2-Analyze the Website Structure
Now that you’ve defined the information you need, it’s time to open the website. Start by going to the homepage. Examine the buttons and features of the website. Note down how navigation works on the site (i.e. which page does this button take me?). Next, try identifying where the information you need is listed on the site. Usually, you’ll find that the information you need is scattered in different webpages. For my case, product names and prices were found on their respective product pages while reviews were found on a separate reviews page linked to its associated product page.
I’ll give you a practical example below:
Now you need to figure out how to get to each webpage. Once you figure out how to link the pages, you’ve got yourself a plan. You’ll encounter many intermediary pages between each relevant page. My scraping plan looked like this:
Step 1: Go to the Amazon homepage.
Step 2: Search for product brand (leads to results page).
Step 3: Results page holds links to multiple product pages. Scrape all these links and store in a file.
Step 4: Iteratively open each link in the file. Scrape product name and price. Click the “see all reviews” button (opens reviews page).
Step 5: Scrape all reviews.
Tip 3-Understand URLs
URLs hold a wealth of information. For example, I managed to filter out duplicate entries in my data because I knew that one component of an Amazon product URL was the product’s unique identifier, known as ASIN (Amazon Standard Identification Number).
Besides that, URLs provide an alternative strategy to button clicks for linking webpages. If you ever get stumped trying to locate or click a particular button that links webpages, try comparing the URLs between the webpages. One of the commonalities between different websites is that going from the first results page to the next can be done by appending something similar to “&page=n” to the end of the URL, whereby n represents page 2, 3…onwards.
Tip 4-Don’t Jumble Your “Waits”
In general, you have implicit waits and explicit waits. When you code an explicit wait, you define how long the program should wait until a condition is met. On the other hand, an implicit wait tells the web driver to wait for a user-defined length of time before throwing an exception.
One of the trickiest issues to troubleshoot is when your programs fails to timeout even after exceeding your specified wait duration. This is usually due to unknowingly mixing implicit and explicit waits.
The following resources are good if you intend to use Selenium:
Tip 5-Choose Strategic Locators
Although there aren’t any hard and fast rules for which element locator types you should choose (e.g. Xpath, CSS, ID, etc.), you should bear in mind a few key points:
- CSS tends to be faster than Xpath in most browsers.
- If ID is one of the options, use it.
- Some locators can point to multiple elements if not written completely. Make sure your locator points to the exact element.
- Use a browser extension such as Ranorex Selocity to help locate elements.
*Bonus Tip-Follow My Tutorial
If all else fails, read my tutorial.
Now go ahead, reopen all those closed tabs and get coding. Once you get the hang of it, you’ll be extracting customized data to your heart’s content. If you found this article useful, it would be great help if you could leave some claps. I’d love to hear your thoughts for my own learning, so feel free to respond with some opinions and tips! Thank you for reading!