Web scraping is an art, not a science

Joe Osborne
5 min readMar 4, 2023

--

Refining web scraping skills and processes requires a degree of artistry, not just raw programming knowledge.

I started web scraping in 2019 working at Dexi.io. The company had built a UI to create processes to pull data from web pages. My job was to use the UI to create those processes and deliver data to clients. We were hitting millions of web pages per day, scraping terabytes of data. As you can imagine things often got messy.

Feeling extra frustrated by the amount of fires I was greeted with in the data pipeline one morning, I sought solace from our VP of Data, Kaleb. Kaleb shared with me the adage that had guided him through the depths of our data ocean — “Web scraping is an art, not a science.” That phrase instantly resonated with me, and has proven itself to be true time and again.

There are three truths to understand about the internet in regards to programmatically scraping data:

  • Every website is different
  • The internet is in a constant state of change
  • It is inevitable that scraping processes will break given enough time

If you understand those three concepts, your approach to web scraping and your view of web data will be much healthier. Let’s take a look at some examples of common web scraping scenarios.

Yelp.com is a very high traffic website that is scraped by thousands of automated bots every day. Let’s say we wanted to scrape restaurant listings and reviews on Yelp. We want to capture the business names first. Here’s what their HTML looks like:

Every class on every tag has a long string of color names and various codes that likely denote different CSS properties. Things like border-color--default and arrange-unit are not very helpful when it comes to the data we want. A very common mistake is to take the name of the class as you see it, and use that in your capture. A CSS selector that might be used could look like this:

div.businessName__09f24__EYSZE display — inline-block__09f24__fEDiJ border-color — default__09f24__NPAKY

If any of those various CSS properties in the class change, then that capture will break and potentially error out your entire scraper. Instead, we would want something that captures the div tag that contains the string businessName . Here’s what that looks like:

div[class*=”businessName”]

Yelp might change a color or spacing value, but we have a better chance of longevity with our scraper if we only look for div tags that contain businessName . If we are thoughtful about the patterns we see and make our scrapers intuitive, our success rate will go up vastly.

Aside from being thoughtful about the code you write, you need to consider that the websites you are trying to extract data from might have blocking mechanisms. Bot detection is very common, especially for high traffic sites. Part of the art of web scraping is learning how to make your bot blend in with humans. To understand how to do this, it’s important to understand the basics of how websites work.

Every time you visit a website, you are sending a request to the server where the source code of the website is hosted, asking to be allowed to interact with it. The server then responds and either grants you access or denies you. This request/response interaction is what I call the “100 fingered handshake.” When the client makes a request, it is essentially “shaking hands” with the server. When the server receives that handshake, there are hundreds of flags that it feels for to determine if it will let you in or not:

  • Where is this handshake coming from? Where is their IP address located?
  • Is this request from a browser? (Chrome, Firefox, Edge, etc.)
  • Is the browser rendering graphics, or is it headless?
  • Have I received requests from this client before?
  • How many requests have I received from this client in the past week, day, hour, minute or second?
  • Does this client want me to serve up pictures, videos, and other media?
  • Does this client allow JavaScript to be executed after the initial request?

The list could go on and on.

If you find that you are often getting blocked trying to access a certain site, this is when you need to get the brush out. You need to make your request look as much like a normal human request as possible. This is often done through proxy services.

There are plenty of out-of-the-box proxy services you can use — Scraper API, Scrapingbee, and Oxylabs are just a few popular ones. They will take your request and do their best to get back a good response from the server you are trying to access. Most of them allow you to configure certain flags so you can design custom solutions for difficult websites. It is important to remember that just because a request is successful with certain configurations does not mean it will always be successful. You might try to access the same website with the same configurations ten times in the same hour, and only get back five successful responses. Part of the art of scraping is figuring out how to constantly be adjusting your bot in various ways in order to increase your success rate.

When I started web scraping, it was truly amazing to me the vast amount of data I could extract in such a short amount of time at a very low cost. I think many people have similar experiences. However, I believed in the idea that web scraping was a science — that it was as reliable as anything else that ran on code. After all, I never had trouble reading Yelp listings on my computer when I’m browsing restaurant reviews, so why would my web scraper? Once I learned more about the nature of the internet, I began to accept that web scraping was more akin to an art form than a science discipline. Since then, I have been much less frustrated and vastly more successful in my scraping endeavors.

--

--

Joe Osborne

Hi! I'm a software engineer with early stage startup experience. Check out some of my work at https://joeosborne.me :)