Crawling the Web

Samir Gadkari
6 min readNov 28, 2018

--

I tried extracting information from HTML pages by writing Python code. This is what I had to think through to process the data.

TLDR; It worked , but I still had to do some manual editing. I should have explored the data, before writing code to do the actual task. Next time.

Also, I’m using the Chrome browser, and if you’re not, these instructions will vary.

An example

I decided to get the CIA Factbook data from the CIA website. I considered the option of using a web crawler to get the data. Here’s some information on what a web crawler should consider.

robots.txt

Downloading data from a website is not as easy as it looks. The robots.txt file on a website indicates which areas your bot can see. Here’s an example of the file from Yahoo:

User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /bin/
Disallow: /includes/
Disallow: /blank.html
Disallow: /_td_api
Disallow: /_tdpp_api
Disallow: /_remote
Disallow: /_multiremote
Disallow: /_tdhl_api
Disallow: /digest
Disallow: /fpjs
Disallow: /myjs

As you can see, there are many parts of the Yahoo website that a robot may not go. Anything that is not disallowed, is open for web crawlers. If the website wanted to not allow any robots at all, it would put this in the robots.txt file: User-agent: * Disallow: /

User-agent

Anyone can spoof the user-agent. This value is not considered important. To find your user-agent, look in your browser. For Chrome:

1. Open the "Developer Tools"2. Select the "Network" tab 3. Load Google's home page. You will see a lot of scrolling in the Network tab. 4. Select the topmost item in the list (this will be the GET request to google.com). 5. On the "Request Headers" - click "view source". 6. Check the "User-Agent" line in your browser. In my case, the user-agent was: 
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36

Delay

Now that you can visit some of Yahoo’s website, you must make sure that you don’t visit too often. It is better to put a delay between visits — something like a few seconds. This way, the website owners will not block you from visiting.

Other better options

If you can avoid scraping the website, that would be the option to take. For my case, the CIA Factbook was available for download here.

Another option is if you can get an API (ex. Twitter API to download tweets). Then the work you have to do is to understand the API and use it. Many people have used their API, and you can too. Search in your favorite browser for anything you don’t understand about the API.

Understanding HTML

Looking at the HTML code

Now that you have the HTML, you have to extract information from it. One way is to see the HTML side-by-side with the web page:

  1. Open the downloaded file in your browser. You should give the full pathname to your file (ex. file:///Users/YourName/data/file1.html)

2. Open Developer Tools. This will open a window on the right side of the page. Select the Elements tab on that page.

3. Select a value on the HTML page. Right click, and select ‘Inspect’. The right side will expand to the correct location. This beats having to browse through the whole HTML and opening sections until you find the spot.

4. Keep selecting the little arrows in the right window to open some other sections in the HTML. As you move your mouse over the HTML, it will highlight the selection on the left hand side. This way, you can narrow down to the section you want to look at

Understanding basic HTML syntax

Angle brackets contain HTML tags. The name within those brackets is the name of the tag. Here we show two tags (the start and end tags of the element). A ‘/’ prepends the end tag’s name.

<body>
</body>

Some tags do not have an end tag. These tags end with a space followed by a ‘/’ after their name. This is a line break tag and produces an extra vertical space:

<br />

BeautifulSoup

Python’s BeautifulSoup allows you to process HTML code. This is the official website, and the documentation is good. I will only go through a minor part of what BeautifulSoup can do. You should read the documentation to get the most out of it.

To process the HTML file, you read it in it’s entirety as a string, and pass it to BeautifulSoup.

with open(filename, 'r') as f: 
soup = BeautifulSoup(f.read(), 'html_parser')

Of course, you should put the above in a try/except block. Library functions keep changing. If you update your libraries (pandas, numpy, etc.), the library code changes. If some function you’re using misbehaves and throws an exception, you want to catch it.

soup.find() finds the first tag with the given name. soup.findall() returns an object containing all tags with the given name. Specify a name, attribute, or both in those function calls.

Finding data

This is what my HTML looked like:

<td valign="top" width="200" style="border-right:2px solid white; padding-left:4px;" class="fl_region"><a href="../geos/af.html" style="font-weight: bold;" >Afghanistan</a>
<td class="category_data" valign="top" style="padding-left:5px;">
2.629% (2009 est.)
</td>

Yes, I know it’s grungy. Customers don’t look at HTML code, so it may not be pretty.

Notice the first tag has the ‘class’ attribute whose value is ‘fl_region’. Its child is the tag, whose child is the string ‘Afghanistan’. The tag’s sibling is the tag with the class ‘category_data’, and its child is ‘2.629% (2009 est.)’.

What’s more, there is another hidden child, between the tag and the second tag, and that is the ‘\n’ (carriage return) value.

This code gets you all the tags that match the class value, and the country name. This is an example. You should find the data using “category_data”, and then put it in a dictionary with the key as the country name.

matching_tds = soup.find_all('td', class_ = 'fl_region')
for child in matching_tds:
if (child.name == 'a') and len(child.string > 0):
country = child.string.strip()
if ((child.name == 'a') and \
(len(child.string) > 0)):
country = child.string.strip()

Conclusion

Text can contain whatever the user wants to put in. Since HTML is text, it can be different from tag to tag. In the above example, if the tags were not exactly the same, your code would need to change. If the values were not exactly the same, your code will need to change further.

Having gone through this now, I would change my approach like this:

1. If possible, use an API to get the data.  This puts the onus on the API creator to have the data in a good format.  It puts the onus on you to learn the API.  If you're going to use the API over and over again, it's less work to learn it in the beginning.2. Write code to find out more information about how similar your tags/values are.  Or read the HTML file to check if they're consistent.  For a small file, this is fine.  For a large file, or many files, you may have to write at least some code.3. Put the output of your code into a dictionary.  The keys will be the country name (for example), and the value the tag with the data.  The idea is to put the more variable data as the key. Since dictionary keys are unique, you will see all unique keys without repeats.
You can see how to structure your code to process them further. This is in case there are sub-structures under the key.
4. Decide how to code for the variability in the keys, and go for it.

Originally published at samirgadkari.github.io.

--

--

Samir Gadkari

Software/Firmware Engineer with 20 years experience. Looking to solve problems in the Data Science field.