Python Pandemonium
Published in

Python Pandemonium

Implementing beanstalk to create a scaleable web scraper

Image Credit (http://blog.hqc.sk.ca/wp-content/uploads/2012/12/Queue-2012-12-11.jpg)

Queues are often used to make applications scaleable by offloading the data and process them later.

In this post I am going to use BeansTalk queue management system in Python. Before I get into real task, allow me to give a brief intro of Beanstalk.

What is Beanstalk?

From the official website:

Beanstalk is a simple, fast work queue.

Its interface is generic, but was originally designed for reducing the latency of page views in high-volume web applications by running time-consuming tasks asynchronously.

It’s demon called beanstalkd which you can run on *nix based machines. Since I am on OSX so I called brew install beanstalkd to install it.

Once installed you can run it’s demon by running the following command:

./beanstalkd -l 1.2.3.4 -p 11300

where 1.2.3.4 is the IP of the machine queue is running and 1130 is the port number.

You can also run it by doing:

./beanstalkd only and it will run on localhost. You can also enable verbose mode by enabling -V switch.

Queue implementation

Our goal is to use scrapers holding list of URLs to be scraped and respective parsed data. Ideally data is stored in a database but since the number of URLs are getting increased so it’s putting lots of burden on our MySQL server due to extensive I/O operations thus making the entire system slow and less efficient. Since data is being saved in realtime thus making MySQL response time slow.

So how to cope up with it? Well, what I am going to do is to make two named queues, also called tubes. We need a couple of cubes. Let’s name them unprocessed and parsed respectively. So what will happen that links which we assume are stored in a Db table will be put in unprocessed job, the consumer app, the script that will pull the link and put parsed data in parsed tube.

The first script I am going to make, I am going to call it producer.py. This script will input URLs in named queues aka tubes.

from pystalkd.Beanstalkd import Connectionlinks1 = []
links1.append('http://1.com')
links1.append('http://2.com')
links1.append('http://3.com')
c = Connection("localhost", 11300)
print('Putting jobs in links')
c.use('unprocessed') # Unprocessed links
for l in links1:
c.put(str(l))

After importing pystalkd library I appended arbitrary links in the list and then created connection to the beanstalk queue server running on 11300 port. After that I called use to create a named queue or tube and put the links in it.

I am doing another thing, opening a TELNET connection to the queue server to run different commands, in our case the stats command. Open another terminal window and run the command:

telnet localhost 11300 and then execute the command stats in it. Initially if you run the stats then you see something like this:

$ telnet localhost 11300
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
stats
OK 898
---
current-jobs-urgent: 0
current-jobs-ready: 0
current-jobs-reserved: 0
current-jobs-delayed: 0
current-jobs-buried: 0
cmd-put: 0
cmd-peek: 0
cmd-peek-ready: 0
cmd-peek-delayed: 0
cmd-peek-buried: 0
cmd-reserve: 0
cmd-reserve-with-timeout: 0
cmd-delete: 0
cmd-release: 0
cmd-use: 0
cmd-watch: 0
cmd-ignore: 0
cmd-bury: 0
cmd-kick: 0
cmd-touch: 0
cmd-stats: 1
cmd-stats-job: 0
cmd-stats-tube: 0
cmd-list-tubes: 0
cmd-list-tube-used: 0

everything is at default state. Now I run producer.py code which will insert links in the queue. stats now look like this:

stats
OK 900
---
current-jobs-urgent: 0
<strong>current-jobs-ready: 3</strong>
current-jobs-reserved: 0
current-jobs-delayed: 0
current-jobs-buried: 0
<strong>cmd-put: 3</strong>
cmd-peek: 0
cmd-peek-ready: 0
cmd-peek-delayed: 0
cmd-peek-buried: 0
cmd-reserve: 0
cmd-reserve-with-timeout: 0
cmd-delete: 0
cmd-release: 0
cmd-use: 1
cmd-watch: 0
cmd-ignore: 0
cmd-bury: 0
cmd-kick: 0
cmd-touch: 0
cmd-stats: 2
cmd-stats-job: 0
cmd-stats-tube: 0
cmd-list-tubes: 0
cmd-list-tube-used: 0
cmd-list-tubes-watched: 0
cmd-pause-tube: 0
job-timeouts: 0
total-jobs: 3
max-job-size: 65535
current-tubes: 2
current-connections: 1

current-jobs-ready and cmd-put is set to 3 since we added 3 links into the queue.

Now I am going to make another script called consumer.py which is going to consume these links for parsing purpose and store info into another named queue/tube.

from pystalkd.Beanstalkd import Connectionprocessed = []# dummy method to deal with scraping
def parse(u):
return 'Processed the link:- ' + u
c = Connection("localhost", 11300)
c.watch('unprocessed')
# pulling links from tube for parsing
while True:
job = c.reserve(0)
if job is not None:
processed.append(parse(job.body))
job.delete() # Delete so it does not haunt us back
if job is None:
break
c.use('parsed')
# Storing scraped and parsed data into another tube for later db process
if len(processed) > 0:
for p in processed:
c.put(p)

Again, after making connection it’s time to pull links from the queue for scraping purpose. In consumer.py we used .use and now watch will be used since we are going to pull the links. I made a dummy parse() method that is doing processing and returning data in list . The job is then deleted so it does not come back and haunt us. The loop ends as soon as all jobs are dequeued.

Now it’s time to put parsed data, again, use will be used but this time for parsed tube.

So simple. I am ending it now but what will happen in 3rd step is that you will pull the parsed data again and store into db.

This entire exercise was done to make system scaleable. Notice all intermediate work has been putt off the MySQL db and been taking care by Beanstalkd. The Db is only being used at first and last stage.

That’s it. This initial and beginner tutorial should help you to give beanstalkd a try by making your scraper scaleable across multiple machines. It is very simple to use for small to medium applications.

ScrapeUp helps you to automate your workflows or extract data from different websites. We also provide services that provide you recurring data without worrying about infrastructure. To learn more about it visit ScrapeUp website.

This article originally published here.

--

--

--

A place to read and write about all things Python. If you want to become a writer for this publication then let me know.

Recommended from Medium

Capturing a billion Emojis

Learn PHP Cookie In Details ! — MyWebDiary |

How To Pass the AWS Certified Solutions Architect— Associate Exam

Beware of Go’s Interfaces

Automation Test For Your Software: Go Through This Ultimate Guide

18 Quotes From Elon Musk Sure To Make You Successful in 2018

Data Manipulation using Pandas in Python

OOP Concepts for Beginners

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Adnan Siddiqi

Adnan Siddiqi

Pakistani | Husband | Father | Software Consultant | Developer | blogger. I occasionally try to make stuff with code. http://adnansiddiqi.me

More from Medium

Web-Scraping Kijiji Ads with Python

Data scraping with Python tutorial on real estate website

TrickyCases #6. Scrape Highcharts plots

Creating an App that Connects Driving Directions and Weather Data