Hybrid Web Automation

Published in

Algonaut

7 min readAug 30, 2020

When the captcha pops, the algo stops

In our prior article Web Automation we covered our main options when it comes to programmatically interact with web-based content, depending on the level of complexity required for the task at hand: from raw transactions, structured content, sessions, browsers, to traffic control via proxies. If you are new to this field I recommend you start there as it provides a step-by-step introduction and plenty of examples in Python.

In fact, all you need is there… if it wasn’t by the fact that many content owners hate web automation (often because they want to implement discriminative pricing tactics and force you to subscribe to their expensive premium API for these functions) and try to prevent it by any means possible, the most popular of which are the captcha screens in which you are presented with a challenge that requires human cognitive capabilities in order to solve and proceed with your browsing.

A Better Mousetrap

Of course you can deploy countermeasures in your code, both traditional and AI-based, in order to avoid triggering such captchas: throttle the velocity of your actions, space out your requests unevenly, add sporadic mouse movements and clicks… but no matter how carefully the rodent tiptoes through the site, he is bound to eventually fall in one trap or another, because in this cat-and-mouse game the feline is always holding the upper hand (or paw): the precise triggers are unknown, they may change over time, and some sites even implement trigger-less captchas at unexpected times “just in case”, similar to the random TSA screenings that happen at airports.

And while the first generation of captchas (e.g. “type the digits that you see in this image”) could be defeated implementing context-specific AI algorithms, the version of captchas widely used nowadays have a level of sophistication (e.g. “find specific objects in these series of images, which change over time”) that makes any attempt at automation too time consuming and impractical.

Dropbox’s “turn the animal until it is standing” sophisticated Captcha

So it seems that we are condemned to this motto: when the captcha pops, the algo stops. Our process breaks, our tasks don’t complete and frustration ensues. But there’s an alternative: if the captchas require human cognition… then let’s bring a human into the equation!. That’s what we call Hybrid Web Automation: a system that executes, for the most part, independently but when faced with an unexpected situation (such as a captcha screen) requests the assistance of a human counterpart and waits patiently until all is clear to resume normal operations instead of crashing down on the spot.

Example: Automating Instagram

To make our explanation as practical as possible, we are going to implement hybrid automation in Python to the particular use case of downloading all the pictures of an Instagram profile of our choosing.

It is important to remember, though, that “web automation” goes beyond the mere collection of content, and also includes the possibility of interacting with the web pages filling in forms, providing data and activating services. That is, bi-directional autonomous interaction. But for our purposes web scraping is the simplest case to portray, an MVP of sorts.

Essentially what we need to do is to create a wrapper around our methods using the Proxy Design Pattern such that we don’t call them directly but always via the proxy. Thus instead of:

def do_something(param1):
  driver.get(param1)

We will write our functionality as:

def do_something(param1):
  proxy(_do_something, param1)def proxy(fun, param1=None):
  try:
    return fun(param1)
  except:
    pass # Manual error handling goes here!def _do_something(param1)
  driver.get(param1)

Which reads as: when we request do_something it calls the proxy, which calls the inner _do_something method which in turn executes the required functionality from the browser. Should the task fail at any point, the process rolls back to the proxy where it stops (and calls the human using, for instance, visible and audible signals) until the eventuality is handled.

Sample Code

The complete working code has been published in the same Git repository used in the prior Web Automation article, using Selenium Chrome as our programmatic web driver:

https://github.com/isaacdlp/scraphacks

First we have to login into Instagram. Our code supports both input of credentials as well as retrieval of stored cookies to reutilize sessions (login frequency is often on of the trigger criteria for captchas, and hence we cover our backs by minimizing the need for re-authentication):

def _login(self, site):
  # [...]
  elif site == "instagram":
    if not self._cookies(site):
      self._print("Login to instagram")
      self.browser.get("https://www.instagram.com")
      self.wait()

      login_form = self.browser.find_element_by_css_selector("article form")

      login_email = login_form.find_element_by_name("username")
      login_email.send_keys(creds["username"])

      login_pass = login_form.find_element_by_name("password")
      login_pass.send_keys(creds["password"])

      login_pass.send_keys(Keys.ENTER)

    self.wait(5)
    self.browser.find_element_by_css_selector("nav a[href='/%s/']" % creds["username"])

  if self.use_cookies:
    cookies = self.browser.get_cookies()
    with open("%sscrap.cookie" % site, "w") as f:
        json.dump(cookies, f, indent=2)

  return True

Then, our implementation of the do_something function (called _instagram) is rather simple and can be found below:

def _instagram(self, url):
  props = self._base(url)

  media = []

  try:
    self.browser.execute_script("document.querySelector('article a').click()")

    while True:
      self.wait()

      try:
        image = self.browser.find_element_by_css_selector("article.M9sTE img[decoding='auto']")
        srcset = image.get_attribute("srcset")
        srcs = [src.split(" ") for src in srcset.split(",")]
        srcs.sort(reverse=True, key=lambda x: int(x[1][:-1]))
        src = srcs[0][0]
        media.append({"type" : "jpg", "src" : src})
        except:
          try :
            video = self.browser.find_element_by_css_selector("article.M9sTE video")
            src = video.get_attribute("src")
            media.append({"type": "mpg", "src": src})
          except:
            pass

          try:
              self.browser.execute_script("document.querySelector('a.coreSpriteRightPaginationArrow').click()")
          except:
            break
       except:
         pass  props["Media"] = media

  return props

It basically follows this routine:

Navigates to the target profile.
Traverses all media items sequentially, from last to first (because Instagram, as many other social sites, is built with a Progressive Feed pattern in mind: the older content loads once you scroll down the page).
Grabs the unique url of each media item (thus decoupling the gathering of items from the actual download, again a strategy for captcha prevention).
Returns the list of media items as a “Media” property.

Please note that besides its simplicity, the example has been extended to be able to handle both images and videos off the shelf.

Running our Hybrid Automation (background) to scrap my own Instagram profile (foreground)

Reusable Hybrid Wrapper

What is most interesting in the example above is that both the login and the scrapping methods use the same proxy function! Namely, this one:

def _proxy(self, fun, var = None):
  if self.browser:
     successful = False
     while not successful:
       try:
         return fun(var)
       except Exception as e:
         if self.interactive:
           exc_type, exc_obj, exc_tb = sys.exc_info()
           print("ERROR '%s' at line %s" % (e, exc_tb.tb_lineno))
           cmd = self.default_cmd
           if not cmd:
             props = {"loop": 0}
             if self.audible:
               thread = threading.Thread(target=self._play, args=(props,))
               thread.start()
             cmd = input("*(r)epeat, (c)ontinue, (a)bort or provide new url? ")
             props["loop"] = self.max_loop
           if cmd == "r" or cmd == "":
             pass
           elif cmd == "c":
             successful = True
           elif cmd == "a":
             raise e
           else:
             var = cmd
         else:
           raise e

Reusability is the whole point: the code above handles error gracefully no matter what was the original task at hand, loops a sound a configurable number of times (to call the attention of the human without being annoying in case he is busy with other matters) and presents a command-line prompt with the options of trying the last url, switching to a new url, moving on to the next, or exiting in the case that the error could not be addressed.

Even in this last situation (forced exit) implementing hybrid automation also helps enormously as we are still presented with an interactive session at the current browser in order to understand and debug the issue, instead of a crashed session, a closed browser and an ugly exception printout.

Furthermore, we can encapsulate our proxy into a Python object to keep the code and its actions portable across different web automation projects. That is precisely what we have done in the scrapper folder of our demo:

https://github.com/isaacdlp/scraphacks/tree/master/scrapper

Now the requirements to complete our particular Instagram example are simplified a great deal: first call the object we just created, and then download the specific media items as we see fit. The code we showcase below can be found in the socialscrap.py file:

https://github.com/isaacdlp/scraphacks/blob/master/socialscrap.py

from scrapper import *
import requests as req

target = "isaacdlp"

folder = "download/%s" % target
if not os.path.exists(folder):
  os.mkdir(folder)

scrapper = Scrapper()
try:
  scrapper.start()
  scrapper.login("instagram")
  props = scrapper.instagram("https://www.instagram.com/%s" % target)
finally:
  scrapper.stop()

for i, prop in enumerate(props["Media"], start = 1):
  res = req.get(prop["src"])
  if res.status_code != 200:
    break
  with open("%s/%s-%s.%s" % (folder, target, i, prop["type"]), 'wb') as bout:
    bout.write(res.content)

Beyond that, please check the __init__.py file for more details on the wrapper implementation. It includes other advanced functions as generalized cookie handling, page scrolling, and screenshot captures of websites. Feel free to make the code your own, adapt it to your specific needs and extend it to support other use cases.

You are most welcome! 😃

Hybrid Web Automation

A Better Mousetrap

Example: Automating Instagram

Sample Code

Reusable Hybrid Wrapper

Written by Isaac de la Peña