How I spent 5 minutes to make a email scraper that extracts 1000 emails from a directory listing site
Web scraping can be powerful, and with the help of ChatGPT, it’s easier than ever to whip up a functional scraper in no time. Here’s a step-by-step guide on how I built a scraper in just 5 minutes to collect at least 1000 emails from a directory site, with absolutely zero programming knowledge needed!
It might sound magical, but trust me, it’s ridiculously simple.
Step 1: Check if the Directory Site Uses WordPress
Here’s the trick: Many recent directory sites are built using scrapers that compile info into a WordPress site. WordPress sites have a neat little characteristic where you can access each page using the format: http://sitename.com/?p=PAGE_ID.
To figure out if the site uses WordPress:
- Look for common WordPress elements in the site’s HTML source code, like wp-content, wp-includes, or wp-admin.
- Spot any 3-digit, 4-digit, or 5-digit numbers that appear more than once. That’s likely the PAGE_ID.
- Try accessing a few pages with the http://sitename.com/?p=PAGE_ID format and see if they load correctly.
Even if the site isn’t WordPress, many run on similar ID systems. Click around a few pages and see if a number in the URL changes. That’s your PAGE_ID.
You’ll be amazed how many sites can be scraped this way.
Step 2: Use ChatGPT to Write the Scraper
Just ask ChatGPT to write a scraper in your preferred programming language. Here’s the prompt:
Write a PHP email address scraper that does the following: (1) Visit the URL https://sitename.com/?p=page_number where page_number starts from 10 to 5000. (2) Each URL visit should pause for at least 10 seconds before checking the next URL. (3) For each URL visit, scrape the email addresses found on the page source. (4) Append each email address, followed by a new line, to a text file called output.txt.
The following is a PHP example code done by ChatGPT to scrape email addresses from a range of pages on a WordPress site and save them to output.txt:
<?php
function scrape_emails($start_page, $end_page, $site_url, $output_file) {
$email_pattern = '/[a-zA-Z0–9._%+-]+@[a-zA-Z0–9.-]+\.[a-zA-Z]{2,4}/';
$emails = [];
for ($page_num = $start_page; $page_num <= $end_page; $page_num++) {
$url = $site_url . '/?p=' . $page_num;
$content = file_get_contents($url);
if ($content) {
preg_match_all($email_pattern, $content, $matches);
$emails = array_merge($emails, $matches[0]);
}
}
$emails = array_unique($emails);
file_put_contents($output_file, implode("\n", $emails));
}
// Example usage
scrape_emails(1, 100, 'http://sitename.com', 'output.txt');
?>
You don’t even have to know any programming language to get the job done.
Step 3: Use Your Favorite Online IDE
Now that ChatGPT has written the code, you need a server to run it. Use online IDEs like Replit or PythonAnywhere:
- Replit: Go to replit.com, create a new PHP repl, paste the code, and run it.
- PythonAnywhere: Sign up at pythonanywhere.com, create a new file, paste the code, and run it.
Step 4: Open the Output File
After running the scraper, open the output.txt file to harvest the emails. Each email address will be on a new line.
To filter out duplicates, use a spreadsheet app like Excel or Google Sheets:
- Copy the email addresses from output.txt and paste them into a column in your spreadsheet.
- Use the “Remove duplicates” feature to ensure each email is unique.
Step 5: You’re done!
In just 5 minutes, including the time taken to run the scraper, you’ve successfully extracted 1000 emails from a directory site.
This process showcases the efficiency and power of using ChatGPT to generate useful scripts quickly. Remember, always use scraping ethically and legally!