How to scrape websites with Python and BeautifulSoup

Guillaume Odier
Captain Data
Published in
4 min readNov 8, 2018

What do you do when you can’t download a website’s information? You do it by hand? Wow, you’re brave!

I’m a web developer, so I’m way too lazy to do things manually 🙂

If you’re about to scrape data for the first time, go ahead and read How To Scrape A Website. You can also read a small intro about web scraping.

Today, let’s say that you need to enrich your CRM with company data.

To make it interesting for you, we will scrape Angel List.

More specifically, we’ll scrape Uber’s company profile.

Please scrape responsibly!

Getting started

Before starting to code, be sure to have Python 3 installed, as we won’t cover it here. Chances are you already have it installed.

You also need pip, a package management tool for Python.

easy_install pip

The full code and dependencies are available here.

We’ll be using BeautifulSoup, a standard Python scraping library.

pip install BeautifulSoup4

You could also create a virtual environment and install all the dependencies inside the requirements.txt file:

pip install -r requirements.txt

Inspecting Content

Open https://angel.co/uber in your web browser (I recommend using Chrome).

Right-click and open your browser’s inspector.

Sorry, it’s in French!

Hover your cursor on the description.

This example is pretty straightforward: you want the <h2> tag with the js-startup_high_concept class.

This would be the unique location of our data thanks to the class tags.

Extracting Data

Let’s dive right in with a bit of code:

# we'll get back to this 
headers = {}
# the Uber company page you're about to scrape!
company_page = '<https://angel.co/uber>'
# open the page
page_request = request.Request(company_page, headers=headers)
page = request.urlopen(page_request)
# parse the html using beautifulsoup
html_content = BeautifulSoup(page, 'html.parser')
description = html_content.find('h2', attrs={'class': 'js-startup_high_concept'})
print(description)

Let’s get into the details:

  • We create a variable headers (more on this very soon)
  • The company_page variable is the page we’re targeting
  • Then we build our request. We inject the company_page and headers variable inside the Request object. Then we open the url with the parameterized request.
  • We parse the HTML response with BeautifulSoup
  • We look for our text content with the find() method
  • We print our result!

Save this as script.py and run it in your shell, like this python script.py.

You should get the following:

urllib.error.HTTPError: HTTP Error 403: Forbidden

Oh 🙁 What happened?

Well, it seems that AngelList has detected that we are a bot. Clever people!

Okay, so change the headers variable for this one:

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}

Run the code with python script.py. Now it should be good:

<h2 class="js-startup_high_concept u-fontSize15 u-fontWeight400 u-colorGray3"> The better way to get there </h2>

Yeah! Our first piece of data 😀

Want to find the website? Easy:

# we extract the website 
website = html_content.find('a', attrs={'class': 'company_url'})
print(website)

And you get:

<a class="u-uncoloredLink company_url" href="http://www.uber.com/" rel= nofollow noopener noreferrer" target="_blank">uber.com</a>

Ok, but how do I get the value of the website?

Easy. Tell the program to extract the href:

print(website['href'])

Make sure to use the strip() method, otherwise you’ll have big spaces:

description = description.text.strip()

I won’t cover in detail all the elements you could extract. If you’re having issues, you can always check this amazing XPath cheatsheet.

Save results to CSV

Pretty useless to print data, right? We should definitely save it!

The Comma-Separated Values format is really a standard for this purpose. You can import it very easily in Excel or Google Sheets.

import csv

Add the following lines:

# open a csv with the append (a) parameter 
with open('angel.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([description, website])

What you get is a single line of data. Since we told the program to append every result, new lines won’t erase previous results.

Check out the whole script

The script is available here.

Conclusion

It wasn’t that hard, right?

We covered a very basic example. You could also add multiple pages and parse them inside a for loop.

Remember how we got blocked by the website’s security and resolved this by adding a custom User-Agent? We wrote a small paper about anti-scraping techniques. It’ll help you understand how websites try to block bots.

If you feel like web scraping is too difficult for you or you’re getting blocked, you can always contact us!

You can also use a more advanced version of this script on our platform.

Originally published at captaindata.co on November 8, 2018.

--

--

Guillaume Odier
Captain Data

Co-Founder @Captain Data | Tech lover & entrepreneur