Automating Lead Generation/Email Crawling with python

Amit Upreti
Nov 14 · 3 min read

Today we will learn to automate Lead Generation/Email Crawling with a simple python script.

Photo by Miguel Á. Padriñán from Pexels

Want to skip the post and see the good stuff directly? Here is the Github repo

Lead Generation is a very Lucrative business and people earn a ton of money just by finding emails to their client.

Let’s see what our end product will look like so that I won’t waste your time in case you don’t find this interesting.

Our crawler will visit each and every sub-page of the provided website and look for emails and then save them in a CSV file.

See the code

First, let’s see the code and then I will explain each step

email_crawler.py

Let’s understand what is happening here

First part __init__() function

We have defined the following Sets

processed_urls → will hold the URLs that we have visited(so that we won’t visit the same URL twice)

unprocessed_urls → will hold the URLs that are on the queue to parse

emails → will hold the parsed emails.

We will use the base URL later to make sure our crawler doesn’t visit outside URLs.
For example: if the user passes https://www.medium.com then the base URL would be medium.com. We will use this later to ensure that our crawler will only visit the URL within this domain.

crawl() function

The crawl function is a starting point of our crawler. It will keep visiting all the URLs in the queue until we have visited every URL on the website.

parse_url() function

Our parse_urls function is where extraction happens. Here we

  • parse and filter all the URLs found on the given page.
  • We filtered duplicate URLs, URLs outside the domain and already visited URLs
  • We will also make sure that we don’t try to visit URLs that lead to files such as jpg,mp4, zips.
  • We finally parse the page for emails and then write them to a CSV file.

parse_emails() function

It takes a text input and then finds emails on that text and finally writes these emails to a CSV file.

How do I run this code?

To get a local copy up and running follow these simple steps.

Installation

  1. Clone the Email-Crawler-Lead-Generator
git clone https://github.com/nOOBIE-nOOBIE/Email-Crawler-Lead-Generator.git

2. Install dependencies

pip install -r requirements.txt

Usage

Simply pass the URL as an argument

python email_crawler.py https://medium.com/

Output

➜  email_crawler python3 email_crawler.py https://medium.com/
WELCOME TO EMAIL CRAWLER
CRAWL : https://medium.com/
1 Email found press@medium.com
2 Email found u002F589e367c28ca47b195ce200d1507d18b@sentry.io
CRAWL : https://medium.com/creators
3 Email found joshsrose@me.com
4 Email found yourfriends@medium.com
5 Email found partnerprogram@medium.com
6 Email found dominiquemattiwrites@gmail.com
7 Email found hihumanparts@gmail.com
CRAWL : https://medium.com/@mshannabrooks
CRAWL : https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40mshannabrooks&source=listing-----5f0204823a1e---------------------bookmark_sidebar-
CRAWL : https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40mshannabrooks&source=-----e5d9a7ef4033----6------------------

Sample Data

email crawler sample data
email crawler sample data
sample data from crawling medium

If you have suggestions or find some issues.

Feel free to open an issue or a Pull Request on GitHub.

Thank you for reading.

YOU ARE AWESOME

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Amit Upreti

Written by

probably behind a computer

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade