If you are working at a startup and want to reach out to more potential leads, you may need to collect as many business email addresses as possible.
Though there are many email extraction tools on the Internet, most of them have free quota limits. This tutorial will help you get emails addresses from any websites at any time without limits!
Step 1: Import modules
We import six modules for this project.
refor regular expression matching operations
requestsfor sending HTTP requests
urlsplitfor breaking URLs down into component parts
dequeis a list-like container with fast appends and pops on either end
BeautifulSoupfor pulling data out of HTML files of websites
pandasfor formatting emails into a DataFrame for further manipulation
Step 2: Initialize variables
Then, we initialize a deque for saving unscraped URLs, a set for scraped URLs, and a set for saving emails scraped successfully from the website.
Set are unique. Duplicate elements are not allowed.
Step 3: Start scraping
- First, move a
2. Then we use
urlsplit to extract different parts of the
urlsplit() returns a 5-tuple: (addressing scheme, network location, path, query, fragment identifier).
Sample input & output for
In such a way, we are able to get the base and path part for the website URL.
3. Sending an HTTP GET request to the website.
4. Extract all email addresses from the response using a regular expression, and add them into the
If you are not familiar with Python regular regression, check Python RegEx for more information.
5. Find all linked URLs in the website.
To do so, we first need to create a Beautiful Soup to parse the HTML document.
Then we can find all the linked URLs in the document by finding the
<a href=””> tag which indicates a hyperlink.
Add the new
url to the
unscraped queue if it was not in
unscraped nor in
We also need to exclude links like
http://www.medium.com/file.gz that are unable to be scraped.
Step 4: Export emails to a CSV file
After successfully scraping emails from the website, we can export the emails to a CSV file.
If you are using Google Colaboratory, you can download the file to the local machine by:
Sample output CSV file: