Debugging a Python Web Crawler

About a month ago, we implemented a python web crawler (thanks to David Liu) for collecting job informations from job-seeking websites like zhaopin.com. It has been constructive, crawled over 17, 000 jobs over the past several weeks. But during this process, coffees have been spilled, servers have been taken down. Here’s what happened.

Details relating to how a crawler is built is beyond the scope of this article, thus will not be provided here.

Request line too large

The first “emergence” happened one day when we found that the scheduled crawling is not returning any data. I took the following steps:

  1. check the error log.
  2. resend the request.

When a bug emerges, always dig into the error log and try restoring the crime scene first.

In this case, there’s no error log to be found in the main server which accepts request from our schedule service and commands the crawler service to execute. This takes to step2, resending the same request to the crawler service. As it turns out, the same request gets ‘Request line too large’ , not as a error, but as a normal response. The problem can be easily solved by using a POST api instead of a GET api.

WebDriver Exception

In a test run to figure out the staggering amount of errors returned for crawling tasks, two exceptions caught my eye:

  1. WebDriverException: Message: Can not connect to the Service /root/…/linux/phantomjs
  2. OSError: [Errno 12] Cannot allocate memory

Between the two, I gathered the second one must be caused by the first(wrong, this will be solved much later).

So I started to dig into the first one, by dig I mean, Google it. Top result is a friendly stackoverflow question page, the answer suggests using a different phantomjs build, so I did, and problem solved.

MySQL server has gone away

This happened just liked the ‘Request line too large’ error, it appeared in the crawler api response. Its really a straight forward connection error, but to be fair, I’ve never seen this before.

I did some research, and most result suggest a connection manage problem that the connection may not been closed correctly. But digging into the crawler source code, no such misbehave was found.

Then I asked for help, and David suggested to modify mysqld config and set wait_timeout higher.

Conclusion: Our crawler service was only called once a day, that’s 24 hours between each crawling task start, with the default mysql config, the connection will be closed during that waiting time.

When the server went down

It happened after a manual start of the crawler, then our main server went AWOL. The first image above shows the memory data around that day, I didn’t realize what actually happened until much later.

Back then, I only knew it was caused by the crawler service, it happened many times before in our test environment, which I thought was caused by the lack amount of memory of our test server.

The problem is, a lot of our apps and services runs on our main server, including the crawler service at that time. Back at the moment, there’s nothing to be done but a quick force reboot of the main server. Then I moved the crawler service to a new server several days later.

‘ascii’ codec can’t encode characters in position(1–3) ordinal not in range(128)

As I dig more into the crawling results, one error seems not right. Ok, let me explain, the ‘normal’ error would be like ‘there’s no search block’, which is returned by the crawler as a crawling result. But this one says encode error, that clearly suggests a code-level error, which should be resolved.

I started with googling the error with a prefix ‘python’, the results are all related to str and encode methods. So I simply search the whole code repo for a call of str method, and got nothing.

Then I thought how about just run the crawler app with console and watch the output. That payed off, when I saw a line of log below the error witch says ‘at publish_date’, it struck me that it must be a date-format-convert error, and it is.

When the crawler hits a mapped field called ‘publish_date’, it will convert it into a date string, then the data flows into our admin app. This error occurred when some ‘publish_date’s surprised our crawler app with a non-date-string text like ‘yesterday’, ‘2 days ago’, or even ‘just now’. So I had to change the crawler’s config to tell the app not to convert ‘publish_date’ into date, just treat it as a good simple string, and leave the rest to our admin app.

After solving this error, we were abled to add over 2,000 new job data.

Down to memory hell

The above charts shows how the crawler server went hell after some adjustments. Similar situation actually happened repeatedly before, it’s just by then I begin to realize it’s a memory problem.

When I looked into the memory, most of it was consumed by Phantomjs processes. And after replay the scene several times, I concluded that the cause was Phantomjs Webdrivers not closing property.

I solved the problem by sending a signal to the process before calling quit() to terminate Phantomjs process . But that stills leaves about 5% Phantomjs processes hanging. With no clue as to the cause, I decided to add a timeout mechanism to the Webdriver.

Now, a week past that last patch, the crawler service is alive and well, collecting over 1,000 job positions every day.

Like what you read? Give Jason Zhou a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.