Today we will learn to automate Lead Generation/Email Crawling with a simple python script.
Want to skip the post and see the good stuff directly? Here is the Github repo
This crawler takes a web address as input and then extracts all emails from that website by sequentially visiting every…
Lead Generation is a very Lucrative business and people earn a ton of money just by finding emails to their client.
Let’s see what our end product will look like so that I won’t waste your time in case you don’t find this interesting.
Our crawler will visit each and every sub-page of the provided website and look for emails and then save them in a CSV file.
See the code
First, let’s see the code and then I will explain each step
Let’s understand what is happening here
First part __init__() function
We have defined the following Sets
processed_urls → will hold the URLs that we have visited(so that we won’t visit the same URL twice)
unprocessed_urls → will hold the URLs that are on the queue to parse
emails → will hold the parsed emails.
We will use the base URL later to make sure our crawler doesn’t visit outside URLs.
For example: if the user passes
https://www.medium.com then the base URL would be
medium.com. We will use this later to ensure that our crawler will only visit the URL within this domain.
The crawl function is a starting point of our crawler. It will keep visiting all the URLs in the queue until we have visited every URL on the website.
parse_urls function is where extraction happens. Here we
- parse and filter all the URLs found on the given page.
- We filtered
URLs outside the domainand
already visited URLs
- We will also make sure that we don’t try to visit URLs that lead to files such as jpg,mp4, zips.
- We finally parse the page for emails and then write them to a CSV file.
It takes a text input and then finds emails on that text and finally writes these emails to a CSV file.
How do I run this code?
To get a local copy up and running follow these simple steps.
- Clone the Email-Crawler-Lead-Generator
2. Install dependencies
pip install -r requirements.txt
Simply pass the URL as an argument
python email_crawler.py https://medium.com/
➜ email_crawler python3 email_crawler.py https://medium.com/
WELCOME TO EMAIL CRAWLER
CRAWL : https://medium.com/
1 Email found email@example.com
2 Email found u002F589e367c28ca47b195ce200d1507d18b@sentry.io
CRAWL : https://medium.com/creators
3 Email found firstname.lastname@example.org
4 Email found email@example.com
5 Email found firstname.lastname@example.org
6 Email found email@example.com
7 Email found firstname.lastname@example.org
CRAWL : https://medium.com/@mshannabrooks
CRAWL : https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40mshannabrooks&source=listing-----5f0204823a1e---------------------bookmark_sidebar-
CRAWL : https://medium.com/m/signin?operation=register&redirect=https%3A%2F%2Fmedium.com%2F%40mshannabrooks&source=-----e5d9a7ef4033----6------------------
If you have suggestions or find some issues.
Thank you for reading.