How I spent 5 minutes to make a email scraper that extracts 1000 emails from a directory listing site

Mr. 6
Generative Geeks
Published in
3 min readJun 3, 2024

--

Web scraping can be powerful, and with the help of ChatGPT, it’s easier than ever to whip up a functional scraper in no time. Here’s a step-by-step guide on how I built a scraper in just 5 minutes to collect at least 1000 emails from a directory site, with absolutely zero programming knowledge needed!

It might sound magical, but trust me, it’s ridiculously simple.

Step 1: Check if the Directory Site Uses WordPress

Here’s the trick: Many recent directory sites are built using scrapers that compile info into a WordPress site. WordPress sites have a neat little characteristic where you can access each page using the format: http://sitename.com/?p=PAGE_ID.

To figure out if the site uses WordPress:

  • Look for common WordPress elements in the site’s HTML source code, like wp-content, wp-includes, or wp-admin.
  • Spot any 3-digit, 4-digit, or 5-digit numbers that appear more than once. That’s likely the PAGE_ID.
  • Try accessing a few pages with the http://sitename.com/?p=PAGE_ID format and see if they load correctly.

Even if the site isn’t WordPress, many run on similar ID systems. Click around a few pages and see if a number in the URL changes. That’s your PAGE_ID.

You’ll be amazed how many sites can be scraped this way.

Step 2: Use ChatGPT to Write the Scraper

Just ask ChatGPT to write a scraper in your preferred programming language. Here’s the prompt:

Write a PHP email address scraper that does the following: (1) Visit the URL https://sitename.com/?p=page_number where page_number starts from 10 to 5000. (2) Each URL visit should pause for at least 10 seconds before checking the next URL. (3) For each URL visit, scrape the email addresses found on the page source. (4) Append each email address, followed by a new line, to a text file called output.txt.

The following is a PHP example code done by ChatGPT to scrape email addresses from a range of pages on a WordPress site and save them to output.txt:

<?php

function scrape_emails($start_page, $end_page, $site_url, $output_file) {
$email_pattern = '/[a-zA-Z0–9._%+-]+@[a-zA-Z0–9.-]+\.[a-zA-Z]{2,4}/';
$emails = [];
for ($page_num = $start_page; $page_num <= $end_page; $page_num++) {
$url = $site_url . '/?p=' . $page_num;
$content = file_get_contents($url);
if ($content) {
preg_match_all($email_pattern, $content, $matches);
$emails = array_merge($emails, $matches[0]);
}
}
$emails = array_unique($emails);
file_put_contents($output_file, implode("\n", $emails));
}

// Example usage
scrape_emails(1, 100, 'http://sitename.com', 'output.txt');

?>

You don’t even have to know any programming language to get the job done.

Step 3: Use Your Favorite Online IDE

Now that ChatGPT has written the code, you need a server to run it. Use online IDEs like Replit or PythonAnywhere:

  • Replit: Go to replit.com, create a new PHP repl, paste the code, and run it.
  • PythonAnywhere: Sign up at pythonanywhere.com, create a new file, paste the code, and run it.

Step 4: Open the Output File

After running the scraper, open the output.txt file to harvest the emails. Each email address will be on a new line.

To filter out duplicates, use a spreadsheet app like Excel or Google Sheets:

  • Copy the email addresses from output.txt and paste them into a column in your spreadsheet.
  • Use the “Remove duplicates” feature to ensure each email is unique.

Step 5: You’re done!

In just 5 minutes, including the time taken to run the scraper, you’ve successfully extracted 1000 emails from a directory site.

This process showcases the efficiency and power of using ChatGPT to generate useful scripts quickly. Remember, always use scraping ethically and legally!

Disclaimer: Web scraping can be illegal in many countries. It’s crucial to stay within legal boundaries and respect the terms of service of any site you’re scraping. This article is purely instructional and demonstrates the power of ChatGPT in writing programs fast. Always ensure you have permission to scrape a site.

--

--

Mr. 6
Generative Geeks

Book author. AI enthusiast. Daily-writing hobbyist. Built a startup using a primitive form of generative AI to mass-generate ads in 2017. Stanford graduate.