Improve Your Web Scraper With Limited Retry-Loops — Python

Published in

The Startup

9 min readAug 11, 2020

Mastering exception-handling is of pivotal importance for producing clean and stable Python code. Chances are high you’re already aware of that, as most Python books geared towards newcomers to the language — and often to coding in general — make sure to spend a few paragraphs on the subject.

In essence, exception-handling means you’re providing an alternative action when a specific piece of code (usually a single line) throws an error for whatever reason. The attentive coder incorporates exceptions in their script for the same reason they use regex patterns for string matching: your code should harbor as few assumptions as possible. An operation on one type of dataset might not work for a slightly different kind of dataset, and what works today might not work tomorrow. Exception statements are therefore the last safeguard against breaking your code.

The try-except-else-finally clause is a classic and stamped into every aspirant-Python aficionado from day one:

Try:
#Whatever you wish to execute
Except:
#If this throws an error, do the following. Usually the user implements either one of these three steps: 
(a) throw an exception (which breaks the code) or 
(b) perform an alternative action (e.g. record that the try-statement produced an error, but continue with the code anyway)
(c) pass (simply execute the rest of the script)
Else:
#If this doesn't throw an error, do the following
Finally:
#The finally-statement is optional. Whether the try-statement produced an error or not, do the following.

Exception-handling is especially important for web scraping. After all, there are plenty of reasons why your scraper might break unexpectedly, such as:

A particular page does not exist.
Many scrapers have built-in assumptions revolved around URL-structure. For example, for a scraper I wrote on collecting info on Android apps, the code assumes that filling in the app-name in the following URL-structure will result in a 201 response code (i.e. connection succeeded): ‘https://play.google.com/
store/apps/details?id=appname’. Although this is true in most cases (go ahead and fill in ‘com.nianticlabs.pokemongo’), some apps might get deleted from the app store or are downloaded from different repositories in the first place. If that’s the case, I instructed the script to try out an alternative platform (e.g. APKMonk).
A particular web element does not (always) exist.
For example, while scraping some news websites; it’s possible that only some articles will include a subheading, hidden in a particular tag-attribute combination. One could simply check whether a subheading is present or not (try), skip the scraping of the headline or add an error message if it’s not available(except) and scrape the headline if possible (else).
Lost internet connection or website outages.
Although this can be due to a lousy internet connection, it’s also a factor to consider when you’re performing IP rotation in one way or another, by using proxies or rotating between different VPN servers for example. In these cases, it sometimes takes a little while for the connection to be established. However, sometimes the problem lies not with the client byt with the server itself. For instance, some websites are notorious for their frequent outages due to heavy server loads (e.g. Reddit).
IP-block.
Many websites want to ward off scrapers since they can increase server load (if used aggressively). Other platforms consider scraping as theft and are therefore outright against automated data collection no matter how slow the scraping process. For this reason, many platforms have some kind of bot-detection in place, giving them the ability to block your IP — at least for a little while.

The first two instances call for a classic try-except clause. For instance, while scraping the Metacritic-platform using the well-known Beautifulsoup package, I noticed that most outlets on their review page are represented by logos (usually the more prestigious ones like The New York Times), while others are represented by a simple text attribute.
The following piece of code tries to catch the alternate title of the image (e.g. “The New York Times”) for each available review of a specific product (try). If this results in an error, the script reasons that the outlet must be represented as simple text and subsequently tries to catch the first ‘a’ tag within the review (except). In the end, the outlet is added to our outlet-list, no matter whether the outlet is represented by an image or text (finally).

The latter two causes are a different story. They do not throw an error because you wrongfully assumed a particular page or element exist, but simply because there is some kind of external factor influencing your ability to connect to the web. Most importantly, they represent (possibly) temporary outages: it might be a matter of seconds before your internet connection or the website itself is back up and running and a platform usually lifts the IP-block after you patiently wait for fifteen minutes or so. This asks for a different approach, since throwing an except right away might be too drastic; it’s worth trying again for a couple of times before discarding the whole process. In other words: you need a finite or limited retry-loop.

In this kind of loop, the except will trigger a short pause before executing the try-part again. It will do this until it has exhausted its number of retries. After that, the coder can still choose to either pass to the following chunk of code, throw an error or append some kind of error message and continue with executing the remainder of the script. This last part is crucial, as it differs from a (possibly) infinite while-loop as in the following piece of code:

output = None
while output is None:
    try:
        #Do the following
    except:
         pass

This will let your script run in an infinite loop if it’s unable to execute the try-part, which — in a sense — is just as script-breaking as not incorporating the try-part at all.

The most elegant and efficient way of constructing a limited retry-loop is by combining a for-loop, a try-except and the continue statement all in one go. The template looks something like this:

for attempt in range(n):
    try:
        #Do the following
    except:
        time.sleep(n2)
        continue
    else:
        break
else:
 #Do the following if after n tries the try-part throws an error

The continue-statement combined with the else-clause outside the loop are key here. The continue-statement sends the code back to the beginning of the loop (i.e. the try-part). Before it does that, though, it pauses the script for n2 seconds, giving your internet connection and/or the website a chance to go back online. Let’s call this template the for-catch loop, since the for-try-except-else-else clause doesn’t sound all that catchy.

Using the for-catch loop was a real game-changer for me, especially for avoiding temporary IP-blocks. Let’s go back to the Metacritic (MC)-scraper I mentioned earlier. I collected the review overview pages of around 1000 movies (as an example, here’s the review page of Son of Saul). The aim was to scrape all the reviews (score+outlet) from the 1000 URLs. However, MC isn’t exactly keen on tolerating my scraper on its platform, even though I made sure to include plenty of pauses into my script to avoid sending a thousand requests in a minute or so. But alas, MC ruthlessly blocks your ip from time to time, and it’s impossible to predict when and why a block will be triggered. I tried to use some proxies, but to no avail. So I came up with the following for-catch loop to avoid breaking my script:

So this is what the for-catch loop does in this instance:

Start the first try from range 0–3 (for).
Sleep for 10–15 seconds and try to access the Metacritic URL (try)
If this doesn’t work (except), display a warning message and let the script sleep somewhere between 4 and 6 minutes. After that, execute the try-part again. Do this until the loop is finished (from range 0–3; so a maximum of four tries or three retries, whatever you want to call it) or until the try-part works.
If the connection is successful (else), break the loop. This means the code will execute the last line (i.e. Creating a Beautifulsoup object, the end goal of this loop)
If the script was able to complete the entire loop — which is bad news in this case — execute the else clause outside the loop and throw an exception (“Something really went wrong here…I’m sorry.)

Since I scraped the reviews of around thousand movies in total from an equal amount of pages, I really needed a script that I could just execute and go about my day without worrying about the ip blocks it would surely bump into. And this is what the for-catch loop afforded me: a peace of mind. It worked its magic for my MC scraper: the ip-blocks were always lifted after a couple of minutes. For some other platforms I had to experiment and come up with more extreme sleep ranges (e.g. between 10 and 15 minutes), but the for-catch always pulled the trick.

When I showed this to a friend of mine, he was bewildered by the outer else-clause, which actually functions as an except here. The confusion is understandable, since we’re used to interpret else clauses as part of a try-except or if-clause. However, the else-clause behaves similarly as in the aforementioned classic try-except-else-finally structure: the else-clause in that case is also executed when the try-part ‘has runs its course’ (and did not throw an error). Thus, the else-clause is always executed when a for-loop has exhausted its iterations, that’s it. In this case, we obviously hope we can break out of the loop before it ever finishes (i.e. before we used up all our retries).

The for-catch loop has helped me in unexpected ways as well and has often served me as a less time-wasting and script-breaking alternative to the while clause.

Take the following example from a script I wrote for switching between NordVPN servers on Linux or Windows (available on Github right here). Somewhere within the script, I wanted to:

Fetch and display the current ip
Connect to a new server
Fetch and display the new ip

Although this seems simple, the third part can be somewhat tricky for two reasons:

Even after connecting to a new server, it can take a little while (especially on Windows) before you can successfully request your new ip. So if the NordVPN app is still busy switching servers, you’ll get a connection error. At the same time though, you don’t want to perform the new ip-request too early as well, since the odds are relatively high that you’re actually still browsing the web through your old NordVPN server. In that case, you’re you’re just requesting the old ip.
This means we’ll need a for-catch loop with an additional check whether the ip requested is different from the previous ip.
Sometimes (again, only on Windows) the NordVPN app gets stuck and you need to reconnect to a different server.
This means we’ll need another for-catch loop.

So essentially we end up with a for-catch loop within another for-catch loop (catch-ception!). The simplified version of the code I actually used looks something like this:

This code snippet avoids an infinite loop, incorporating multiple retries, and at the same time avoids wasting time. The script does a total of 12 retries (see line 17) to fetch a new ip. The first one or two tries will inevitably be too early, resulting in the same ip (new_ip == current_ip) and the script will pause for 5 seconds before retrying. However, as soon as a new ip is successfully requested, the for-catch breaks. If there’s still no new ip after a minute (5 seconds *12), the else clause will let it be (pass, line 30), but than the script gets caught up in another for-catch clause (although I opted for an if-clause instead of an except-catch (see line 31)). If the ip hasn’t changed, the script sleeps for 10 seconds and tries to connect to a new server again (for a maximum of 5 times, see line 14).

I hope I demonstrated the usefulness of the for-catch loop and why it is especially helpful for many web scraper applications. As a more flexible alternative to the while statement, it allows the script a finite number of retries, allowing you to pause the scraping process if necessary and do whatever you want when the script has exhausted its number of retries.

Happy scraping!

Improve Your Web Scraper With Limited Retry-Loops — Python

Written by Kristof Boghe