Fighting NULL in web scrapers

James McKinney
5 min readJun 26, 2017

--

One of the last things _why wrote is:

To me, fighting NULL is the epitome of why I struggled as a programmer. I am not a natural at it, but I wanted very much to be — and I found no use for NULL. I never needed it, but it was always there. I kept pushing it down, painting over it, shutting it up, constantly checking for it — “Are you NULL? Are you NULL? What about you?” — and sometimes I would deceive myself, that my problems were other things, but then NULL would pop up, I would find that it was the cause — however, NULL is never really the cause. It is someone you always run into in bad situations, someone you never want to see. NULL penetrates all the layers to find you, and can only say, helplessly, “Looks like you’re having a problem.” Endemic to the problem, not the problem, complicit, and might be the problem.

I’ll come back to NULL.

A web scraper is a computer program that extracts machine-readable data from human-readable websites. EveryPolitician, for example, uses scrapers to maintain the largest open database of politicians in the world. When a legislature changes the layout of its politicians’ webpages, the scraper may no longer function — it may ‘break’ — because data elements (like dates of birth or death) are no longer where the scraper expects to find them.

A break is good, because it alerts the programmer that they need to fix an issue. It’s much worse when the scraper continues to function despite a layout change and extracts the data incorrectly. For example, imagine a scenario where a legislature swaps the positions of data elements, such that the scraper finds the death date where it expects to find the birth date, and vice versa. Suddenly, all the politicians are dead before they are born. Or the legislature moves the position of the death date such that the scraper finds nothing for it. Suddenly, all the politicians are alive (forever!).

Obviously, no programmer wants their scraper to behave like this, but to avoid it, they would need to write code in a different way from how they would in an environment with more predictable, better behaved input. As a result of writing and maintaining hundreds of scrapers over a decade, I’ve learned this way of writing code. This post describes some of the differences.

My approach to writing a scraper is to extract the data such that the scraper tests whether the layout has changed at the same time. The scraper is strict (if it can’t find a required data element like a politician’s given name, it will break), validates itself (if the text for a birth date can’t be read as a date, it will break), and otherwise fails on unexpected input. In other words, the scraper is a view spec for the webpage it is scraping.

It can be counter-intuitive to write code that handles fewer inputs and that breaks more easily and regularly. But the greater effort upfront to write a scraper in this way lessens the maintenance burden and error rate later.

A few examples of do’s and don’ts of this defensive approach, in the Ruby programming language:

Don’t

This passes if no tr HTML tags are found on the webpage, if for example the scraper is extracting a list of people from a table:

nodes = Nokogiri::HTML(response.body).xpath('//tr')
nodes.each do |node|

Do

Instead, this woud raise an error:

nodes = Nokogiri::HTML(response.body).xpath('//tr')
raise "expected //tr to be found" if nodes.empty?
nodes.each do |node|

Note: The Python programming language has an assert statement for this:

assert nodes, "expected //tr to be found"

Don’t

This sets name to NULL if the 'full-name' key is missing:

name = JSON.load(response.body)['full-name']

Do

Instead, this woud raise an error:

name = JSON.load(response.body).fetch('full-name')

Note: In Python, the [] method fails if the key is missing; you don’t need to use a less common method, like fetch in Ruby, to cause a failure.

Don’t

This sets value to NULL if the regular expression doesn’t match:

value = string[/re(ge)x/, 1]

Do

Instead, this woud raise an error:

value = string.match(/re(ge)x/)[1]

Note: In Python, the corresponding re.search('re(ge)ex', string).group(1) fails; you don’t need to use a less common method, like match in Ruby.

Don’t

These set number to zero if value is NULL or a non-numeric string:

number = value.to_i
number = value.to_f

Do

Instead, these woud raise an error:

number = Integer(value)
number = Float(value)

Note: In Python, the methods int() and float() fail if value is NULL or a non-numeric string; you don’t need to use less common methods.

Ruby versus Python, and NULL

In general, Python is more strict than Ruby; it raises errors if expectations aren’t met, and requires you to relax expectations deliberately. Ruby, on the other hand, allows NULL to creep in all over the place. You need to learn many unconventional methods to avoid NULL—to raise expectations and force errors as above. This is what _why referred to as “fighting NULL.”

Some other errors that don’t occur in Python but do in Ruby:

Below, accidentally using #each instead of #map assigns toresult the original list instead of the transformed list. Impossible in Python:

result = list.each do |item|
item.strip.downcase
end

Below, accidentally adding a line after the value that should be returned from an if-statement’s else branch introduces a bug. Impossible in Python:

result = if criterion
'a'
else
'b'
logger.info('debug')
end

Many bugs like these creep in through the cracks created by programmers wanting to write code that is as brief as possible.

If you’ll be writing and maintaining a large number of scrapers, or if you’ll be accepting scrapers contributed by volunteers, you may prefer Python, because it’s stricter by default than Ruby, which will make code easier to validate and maintain.

That said, scrapers in Python aren’t invulnerable to NULL creeping in. The following sets name to the empty string if the data element isn’t found:

name = page.xpath('string(//div[@id="fullname"])')

Instead, this would raise an error:

name = page.xpath('//div[@id="fullname"]/text()')[0]

To round off the things I’ve learned about this way of scraping:

  • In Ruby, call DateTime.strptime(string, format) on date strings that are expected to be consistently formatted (like %m/%d/%Y for “12/31/2010”) in order to raise an error if the date string is incorrectly formatted.
  • Relax your scraper’s expectations as little as possible. For example, if the scraper uses the regular expression C\d{3} and it encounters “D123”, don’t change it to [A-Z]\d{3}. Change it to [CD]\d{3} instead.
  • Investigate every case where an expectation is broken. If you’re writing a scraper and an XPath selector fails on a page, look at the page to see why it fails. Don’t just assume the data element is missing or optional.

And, finally, tips on writing more strict regular expressions:

  • Use the \A and \z anchors as much as possible, to avoid matching text at unexpected positions.
  • Use the {n,m} quantifier, instead of * or +, if you know the number of characters to expect.
  • Use [A-Za-z] to match letters only, instead of \w, because \w will also match numbers and underscore.
  • Use short bracket expressions like [abc] instead of character classes like \w unless the text you want to match actually covers the character class.
  • Avoid using .+, because it matches literally anything. In many cases, you can use\S+ or .{n,m} instead.
  • Avoid using case-insensitive matching. Instead, change the appropriate bracket expressions to allow uppercase and lowercase letters.

Thanks to Lex Gill for editing this post.

--

--