Web Automation

Isaac de la Peña
Algonaut
Published in
7 min readJun 25, 2019

(Published initially November 23rd 2017)

Information is power. This is such a basic principle in finance that those who make privileged use of it get seriously penalized. But on the other hand it is perfectly legitimate to analyze widely available information in a novel way to obtain an advantage in the markets. Although such information is publicly available on the Internet, there will not always be a beautiful API in JSON or XML ready for consumption by algorithms; so having some basic notions of web automation will be of great help in our work.

Actually the heading “web automation” goes beyond the mere collection of data, a reviled concept known as web scraping, and also includes the possibility of interacting with the web pages themselves, verifying credentials, filling in forms, providing data and activating services. That is, full bi-directional autonomous interaction.

A practical example would be the connectors that we have developed at Ágora Asesores Financieros to find the daily quotes of certain exotic funds (not available in the data pipes of Bloomberg or Factset) and update them selectively in the cloud that holds the portfolios of our clients, saving a lot of time, effort and mistakes.

The Python programming language provides us with a series of simple ways to perform web automation. We are going to give a brief introductory tour through them with increasing degrees of functionality.

1. Bare, Naked and Raw

The basic module for Internet interaction in Python is requests. Only with that we already have enough to work with. For example, let’s get the X-Trackers ETF quote on the Euro Stoxx 50 from Morningstar.

import requests as reqres = req.get("http://www.morningstar.es/es/etf/snapshot/snapshot.aspx?id=0P0000HNXD")
text = res.text
start = ">EUR\xa0"
end = "<"
start_pos = text.index(start)
end_pos = text.index(end, start_pos)
print(text[start_pos+len(start):end_pos])

First we download the entire web page from Morningstar, and then find the value that is between the two text strings contained in the start and end variables.

No big deal, but it is not a very robust method and the expressions necessary to isolate text strings can become very ugly very fast. The use of regular expressions can somehow temporarily save us; for example, the previous code would become:

import requests as req
import re
res = req.get("http://www.morningstar.es/es/etf/snapshot/snapshot.aspx?id=0P0000HNXD")
print(re.findall(">EUR\W([^\<]*)<", res.text)[0])

Brief, no doubt about it, but regular expressions can be devilishly complex to decipher as well. And neither do we solve the fundamental problem of fragility: when anything changes the text of the page a bit, even without changing its structure, or for example in the case that the text “EUR” appears in an upper section, the system will fail.

2. With Structure

Luckily we have another option: instead of browsing plain text, we can navigate through the logical structure of the web page that we are visualizing as defined by the HTML code. Python has the BeautifulSoup module that will allow us to navigate the structure using CSS expressions once we have installed it.

> pip install bs4

Our code would then become this:

import requests as req
from bs4 import BeautifulSoup as soup
res = req.get("http://www.morningstar.es/es/etf/snapshot/snapshot.aspx?id=0P0000HNXD")
html = soup(res.text, "html.parser")
print(html.select("#overviewQuickstatsDiv table td.text")[0].text)

We still have to clean the “EUR” part if we want, and ready. The CSS search can be interpreted as “Find the element identified as overviewQuickstatsDiv, and give me the content of the first cell of class text you find inside your table”. As long as the structure of the page does not change, something quite unlikely, our search will succeed.

3. With Sessions

Sometimes the web services will not be available directly but we will have to identify ourselves instead (in terms of Internet, start a session) and perform a series of steps before completing our task. No problem, Python is also willing to lend us a hand here.

Although the requests module itself has support for sessions, the handling of forms, cookies and states can quickly turn complex, so it is advisable to use a higher grade library to encapsulate these tasks and change our journey into a simple walk through the digital park, following links, filling out forms and pressing buttons.

Mechanize has traditionally been a highly dependable library, but unfortunately it has become somewhat outdated by only supporting Python 2.x. In contrast with that, RoboBrowser will provide us with full support for Python 3.x.

> pip install robobrowser

For these more elaborate examples I have created a project-lab on GitHub called ScrapHacks that you can clone on your local machine to perform your own experiments. You’re welcome!

from lxml import etreeurl = "https://www.quefondos.com/es/planes/ficha/?isin=N2676"
resp = req.get(url)
html = etree.HTML(resp.text)
value = html.xpath("//*[@id=\"col3_content\"]/div/div[4]/p[1]/span[2]")

For the case of session management it is worth reviewing the file pricescrap.py, where we combine RoboBrowser with an alternative library to BeautifulSoup called Lxml, interesting because it allows to use XPath expressions in addition to CSS.

> pip install lxml

In the example file we download the price of three financial assets with methods similar to the previous ones, but then we navigate to a service in the cloud, we initiate session with private credentials and we dump these price updates in specific client portfolios.

4. With Browser

Although in the previous example we talk about “surfing” or use the term browser in our code, it is good to understand that it is just a metaphor. RoboBrowser is not a full web browser in the sense that Chrome, Firefox or Safari are. It only emulates part of its functionalities but it lacks many others, such as the ability to execute the JavaScript code associated with the pages.

This is important. Sometimes the JavaScript code is purely decorative, but in others its execution is fundamental for the correct interpretation of the page. Traditionally, the active composition of the page was done on the server side and when it came to the client side it was a static element, but due to the use of certain development frameworks, as well as for reasons of security and flexibility, increasingly the active composition of the page is performed on the client side. In such cases, being unable to interpret JavaScript will entail a miserable failure.

But let’s not throw in the towel so soon. Python has an excellent integration with Selenium, a project that will allow us to take control of the browser of our choice and act as if we were sitting in front of the machine, clicking on buttons and filling in boxes so that it is practically impossible to distinguish a human session from an automated one.

from selenium import webdriverdriver = webdriver.Chrome()
driver.set_window_size(1000, 1000)
driver.get("https://www.duolingo.com")
driver.find_element_by_id("sign-in-btn").click()
driver.find_element_by_id("top_login").send_keys(credentials["username"])
driver.find_element_by_id("top_password").send_keys(credentials["password"])
driver.find_element_by_id("login-button").click()

Please excuse me here while I open a little parenthesis: I am a regular user of the amazing services of Duolingo, with which I have already learned several languages and I hope to learn many more. However, the Mobile App does not allow to see the grammar lessons (for example the part of “Tips and Notes” here at the bottom of the page) that although can be ignored when a Spanish learns Portuguese, become essential if you want to survive while learning of a language as distant as Russian.

However a gentleman does not complain about a gift, and anyway I am more a person of action, so for this example I have created a simple script called duolingoscrap.py that extracts and groups the grammar lessons in a convenient summary. The challenge is that the Duolingo website is interpreted on the client side, which makes the use of Selenium, combined with BeautifulSoup, essential.

5. With Traffic Control

Up, up we go in our pyramid. What do we have left now that we are able to navigate the web as a human? Well, maybe some extra-human capabilities. Sometimes the interpretation of the JavaScript code is so incredibly complicated, often with the explicit desire to keep it safe from eyes too curious as ours, that it is impossible to find the web element we want to extract.

In such cases what we can do is directly observe the traffic that enters and leaves our machine in order to find that element. We can achieve this by combining Selenium with BrowserMobProxy as an intermediary (proxy) between our browser and the world. It is a program written in Java so we will need to have a JRE running on our machine, but this convenient wrapper allows us to work with it as if it were another piece in our Python arsenal.

from selenium import webdriver
from browsermobproxy import Server
browserMob = ".%sbrowsermob-proxy-2.1.4%sbin%sbrowsermob-proxy" % (os.path.sep, os.path.sep, os.path.sep)
server = Server(browserMob)
server.start()
proxy = server.create_proxy()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--proxy-server={0}".format(proxy.proxy))
driver = webdriver.Chrome(chrome_options = chrome_options)
proxy.new_har("safaribooks")
driver.get(url)
har = proxy.har
for entry in har['log']['entries']:
# processing

So we want to download a movie hidden in the code? I may not understand what the activation process of the video is, but I just set it in motion and when I see a multimedia element in my traffic I capture it. Done.

This is precisely the use case for the safarihacks.py script, which uses Selenium to log in with a test account in the SafariBooks on-line library and then proceed to serially download the books in the catalogs of our interest. In the case of multimedia courses, the code gets support by BrowserMobProxy to identify and download the video files.

Finally, the icing on the cake comes in the form of an integration with PDFReactor to automatically convert downloaded books to PDF format.

With these basic techniques you can create very powerful web robots (also called spiders). Have fun experimenting with the ideas in your own projects, and if this post has been useful to you, I would appreciate it if you could recommend it and share it in your social networks.

Check as well this continuation article in which we cover the very relevant topic of Hybrid Web Automation.

Good luck and see you next time!

Summary

Python provides means to turn web automation into a very easy task.

Don’t miss https://github.com/isaacdlp/scraphacks with practical examples:

Useful components:

--

--