How to scrape businesses’ info with Python and Beautiful Soup

The state of Amsterdam coffeeshop map — 2016 (http://www.coffeeshop.freeuk.com/Map.html)

I hear a lot of people say Python is great for web scraping and I believe them. Mostly I write javascript and using tools like the cheerio package can be cumbersome. Python is synchronous, so there are no callbacks or promises to deal with. We will use the Beautiful Soup python package to scrape information about where the “coffeeshops” in downtown Amsterdam are.

Coffeeshops is in quotes because in Amsterdam a coffeeshop is where you legally can smoke marijuana.

tl;dr — SOURCE CODE

A Note on Virtual Environments

Seasoned python developers can skip this part. For people coming from another programming language or n00bs, know that in python you have to set up a virtual environment separate from the python install on your machine. We will use virtualenvwrapper to make life easier. The first step is to install the package and make it available on the command line. Once installed and configured run

$ source ~/.bash_profile

Or restart the terminal. Some new commands will be available. The full list of virtual env commands is available, below are some popular ones.

Virtual Environment Commands

mkvirtualenv — Make a new virtual environment for a project.

workon — List or change to an existing virtual environment.

lsvirtualenv — Show all virtual environments available on your machine.

showvirtualenv — Show details about the current virtual environment.

rmvirtualenv— Delete an existing virtual environment.

deactivate — Switch from a virtual environment to the system-installed version of Python (close the virtual environment).

Create a Project and a Virtual Environment

Using our knowledge about virtual environments, now we’ll set up a brand new folder for our project and a virtual environment in which we can install the Beautiful Soup web scraping package.

$ mkdir web_scraper && cd web_scraper
$ touch scraper.py requirements.txt
$ mkvirtualenv coffeeshops

After the first time you create the virtual environment use

$ workon coffeeshops

to change into the “coffeeshops” virtual environment. If you forget what the environment is called use the “lsvirtualenv” command. A list of all virtual environments you have will show up in the terminal.

Install required packages

The requirements.txt file will contain all the packages that need to be installed in order to run. To install all the listed packages in a requirements.txt file run

$ pip install -r requirements.txt

However right now our requirements.txt file is totally empty! The first step will be install the Beautiful Soup package:

$ pip install beautifulsoup4
Note: we are using pip for this tutorial but there are other ways you can install and manage python packages. easy_install and iPython notebooks come to mind. Pip works great for me.

After installing the package we have to add installed packages to requirements.txt. In node.js there is the “ — save” option. This command is kind of like that for python:

$ pip freeze > requirements.txt

Scrape the Coffeeshop locations

Coffeeshop Smokey | Since 1964

We will scrape http://www.coffeeshop.freeuk.com/Map.html for the coffeeshop locations. Make a file scraper.py. Exeute the below code by running

$ python scraper.py

The Code

from bs4 import BeautifulSoup
import urllib
import json
r = urllib.urlopen(‘http://www.coffeeshop.freeuk.com/Map.html').read()
soup = BeautifulSoup(r, ‘html.parser’)
# Grab all the links
coffeeshops = []
for link in soup.find_all(‘area’):
coffeeshop = {}
 # Link for coffeshop
coffeeshop[‘full_link’] = ‘http://www.coffeeshop.freeuk.com/' + link.get(‘href’)

# Http to link
coffeeshop_site = urllib.urlopen(coffeeshop[‘full_link’]).read()
coffeeshop_soup = BeautifulSoup(coffeeshop_site, ‘html.parser’)
 # Check that title is there
title_elm = coffeeshop_soup.select(‘.goldBig’)

# Make sure the title element selector exists
if title_elm is not None and len(title_elm) > 0:
coffeeshop[‘title’] = coffeeshop_soup.select(‘.goldBig’)[0].get_text()
# Select the iFrame url
coffeeshop[‘iframe_url’] = coffeeshop_soup.select(‘#iCr > iframe’)[0].get(‘src’)

# Add to coffeeshop to list
coffeeshops.append(coffeeshop)
print(coffeeshop)
print(coffeeshops)
# Write to JSON document
with open(‘coffeeshops.json’, ‘w’) as outfile:
json.dump(coffeeshops, outfile)
# Prettify JSON output
# https://jsonformatter.curiousconcept.com/

Full source here. After running this script you will have all the coffeeshops in central Amsterdam available as a JSON document.

Further Reading…

Like this post? Here are some other ones I’ve written. Also follow me on github or twitter.

Create a custom blog theme with Hexo.js

How to get up and running with Golang install

Handle all authentication with Node, Angular and Stormpath

Learning to code? Email list signup link