How to Scrape Email Addresses from a Website using Python?

Vanessa Leung
Jul 9 · 2 min read

If you are working at a startup and want to reach out to more potential leads, you may need to collect as many business email addresses as possible.

Though there are many email extraction tools on the Internet, most of them have free quota limits. This tutorial will help you get emails addresses from any websites at any time without limits!

Step 1: Import modules

We import six modules for this project.

Import modules we need
  1. re for regular expression matching operations
  2. requests for sending HTTP requests
  3. urlsplit for breaking URLs down into component parts
  4. deque is a list-like container with fast appends and pops on either end
  5. BeautifulSoup for pulling data out of HTML files of websites
  6. pandas for formatting emails into a DataFrame for further manipulation

Step 2: Initialize variables

Then, we initialize a deque for saving unscraped URLs, a set for scraped URLs, and a set for saving emails scraped successfully from the website.

Initialize variables

Elements in Set are unique. Duplicate elements are not allowed.

Step 3: Start scraping

  1. First, move a url from unscrapedto scraped.
unscrpaed_url to scraped_url

2. Then we use urlsplit to extract different parts of the url.

urlsplit()

urlsplit() returns a 5-tuple: (addressing scheme, network location, path, query, fragment identifier).

Sample input & output for urlsplit()

Sample for urlsplit()

In such a way, we are able to get the base and path part for the website URL.

3. Sending an HTTP GET request to the website.

Sending HTTP requests

4. Extract all email addresses from the response using a regular expression, and add them into the email set.

Extract emails using regular expression

If you are not familiar with Python regular regression, check Python RegEx for more information.

5. Find all linked URLs in the website.

To do so, we first need to create a Beautiful Soup to parse the HTML document.

Create a Beautiful Soup for the HTML document

Then we can find all the linked URLs in the document by finding the <a href=””> tag which indicates a hyperlink.

Find all linked URLs

Add the new url to the unscraped queue if it was not in unscraped nor in scraped yet.

We also need to exclude links like http://www.medium.com/file.gz that are unable to be scraped.

Add new URLs

Step 4: Export emails to a CSV file

After successfully scraping emails from the website, we can export the emails to a CSV file.

Export emails to a csv file

If you are using Google Colaboratory, you can download the file to the local machine by:

Download from Colaboratory

Sample output CSV file:

Sample output

Complete Code

Complete Code

Reference

[1] Crawling all emails from a website

Vanessa Leung

Written by

M.S. in Business Analytics @UC Davis ’20 | Actively seeking an analytical position | Newbie in Software Programming & Data Science

The Startup

Medium's largest active publication, followed by +536K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade