Web Scrapping Process for Beginners

Mahesh Tiwari
Nerd For Tech
Published in
13 min readSep 8, 2023

Web data extraction, commonly referred to as web scraping, is an essential method in data analytics, business intelligence, market research, and e-commerce. It entails gathering and converting online data — including item pricing, news articles, social media posts, contact information, and financial data — into useful formats.

Photo by Ian Schneider on Unsplash

In an increasingly data-driven world, this process combines and organises data at scale, enabling decision-makers to obtain insightful knowledge, monitor trends, and make wise decisions.

The dynamic nature of web information, the requirement for reliable and flexible scraping methodologies, and ethical and legal issues (Dilmegani, 2023) provide difficulties for web data extraction. This first investigation will lay the groundwork for utilising web data extraction responsibly and successfully.

Table of Contents:

1. Introduction to Asynchronous method for data extraction

2. Understanding dataset

3. Extraction of email address

4. Data Preprocessing and Cleaning

5. Data Visualization

Introduction

Python’s asyncio and aiohttp functions, which enable parallel task execution, are perfect for web scraping activities like email extraction. By enabling the simultaneous processing of several requests, these techniques shorten the time needed for data extraction. When used together, Chromedriver and Selenium offer a potent automation solution for email extraction and site scraping. Selenium enables navigation, interaction, and data extraction by allowing programmatic control of web browsers, including Chrome. By serving as a link between Selenium and the Chrome browser, Chromedriver enables scripts to explore webpages, load content, and find and extract email addresses using regular expressions. Selenium is applicable to numerous web scraping applications since it also offers capabilities to simulate user interactions with webpages. Asynchronous and automated email extraction methods enhance productivity and accuracy in handling large datasets, making them ideal for research, marketing, and data analysis.

The goal of this project is to use this method to extract email addresses from marketing websites around the world. The data of this project is available for download on Kaggle, click here for the data (Kaggle,2023). The project focuses on the usefulness of web data extraction methods and the value of data-driven strategies in modern business. The project will explore methodologies and tools used to provide a way to extract information from the available websites.

Part A: Dataset Understanding

This code shows how to import data from an Excel file (‘Google_partner_with_id.csv’) into a DataFrame using Python’s Pandas package. This data frame contains 2 with 5 columns.

import pandas as pd

# Read the CSV file with the list of websites
input_file = './Google_partner.xlsx'
df = pd.read_csv(input_file)
# Check the shape of the DataFrame
df.shape
# Display the first few rows of the DataFrame
df_data.head()

Upon looking into the first 5 entries of the data frame, we can observe that the third column is empty and this dataset does not have “column names”. Let's remove the third column and give a suitable name to it. Also, we added the new column “ID” so that it will take less storage while saving the email addresses. We did this as below;

# Remove the unnamed column (third column)
df_data = df_data.drop(columns=[3])
# Add column names (adjust to match the actual number of columns in your data)
column_names = ['Company Name', 'Location', 'Website', 'Country']
# Create the DataFrame with column names
df = pd.DataFrame(df_data.values, columns=column_names)
# Add an 'ID' column with unique identifiers
df['ID'] = range(1, len(df) + 1)

# Save the modified DataFrame to a CSV file with the desired columns
output_csv_file = 'Google_partner_with_id.csv'
df.to_csv(output_csv_file, index=False)
print(f"DataFrame with desired columns saved to {output_csv_file}")
df.head()

Then we saved the data into a new CSV file for the web scrapping.

Part B: Extraction of Email Addresses

B.1: Importing Libraries and Defining Functions

In this part of the code, various Python libraries are imported.

import pandas as pd
import aiohttp
import asyncio
import cachetools
import re
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import nest_asyncio
nest_asyncio.apply()

Here’s a brief summary of the role of each library in the code:

  • pandas: Used for data management, including loading and manipulating CSV data.
  • re: (Regular Expressions): Employed to define patterns for extracting valid email addresses.
  • aiohttp is used to send asynchronous HTTP requests.
  • asyncio is a framework for asynchronous programming.
  • Cachetools: Offers techniques for caching to improve data retrieval.
  • selenium is used for web automation and web scraping jobs.
  • selenium.webdriver.chrome.optionsHelp to choose from a variety of options, including headless mode, for the Chrome browser.
  • selenium.webdriver.chrome.service The ChromeDriver executable for browser automation is managed by the service.
  • asyncioNesting event loops are possible using nest_asyncio.

Together, these tools make coding tasks like web scraping, data extraction, and data storage easier.

B.2: Loading and Preparing Data

This section of the code brings data into a Pandas DataFrame (df) from a CSV file called “Google_partner_with_id.csv.” The data frame is then arranged in ascending order according to the ‘ID’ column. By categorising the websites, it is ensured that they are handled in ascending order of their IDs.

# Specify the path to the ChromeDriver executable
chrome_driver_path = '/Users/mahesh/Desktop/chromedriver-mac-x64/chromedriver'

# Set Chrome options for headless browsing
chrome_options = Options()
chrome_options.add_argument('--headless') # Enable headless mode

# Read the CSV file with the list of websites
input_file = './Google_partner_with_id.csv'
df = pd.read_csv(input_file)
df.shape
df.head()

# Create a new DataFrame to store the results (ID, Website, and Email)
output_df = pd.DataFrame(columns=['ID', 'Website', 'Email'])

# Initialize a simple in-memory cache with a maximum size of 1000 items
cache = cachetools.LRUCache(maxsize=100)

# Specify batch size
batch_size = 100

# Initialize the start_index
start_index = 0

B.3: Crawling and Extracting Email Addresses

The two important asynchronous methods scrape_website and scrape_batch are defined in our code to retrieve email addresses from webpages.

Using a website’s URL, index, and information from a DataFrame row, the scrape_website function is an asynchronous web scraping tool. It determines if the URL has been cached and, if not, launches an aiohttp HTTP GET request to get the HTML content. It searches the HTML source code for acceptable email addresses using a regular expression pattern. In the absence of email addresses, it looks at other pages like “/contact,” “/contact-us,” and “/about-us.” If unsuccessful, it logs ‘No email found’. The findings are recorded in the output_df DataFrame, and the identified email addresses are cached. Printing progress messages reveals how many email addresses were located.

# Define an asynchronous function to scrape a single website
async def scrape_website(session, index, row):
try:
website_url = row['Website']

# Check if the URL is in the cache
if website_url in cache:
email_addresses = cache[website_url]
else:
async with session.get(website_url) as response:
page_source = await response.text()

email_pattern = r'\b(?![A-Za-z0-9._%+-]*\.png\b)(?![A-Za-z0-9._%+-]*\.svg\b)(?![A-Za-z0-9._%+-]*\.jpeg\b)(?![A-Za-z0-9._%+-]*\.wixpress\.com\b)[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
email_addresses = re.findall(email_pattern, page_source)

if not email_addresses:
email_found = False
pages_to_check = [website_url + '/contact', website_url + '/contact-us', website_url + '/about-us']

for page_url in pages_to_check:
async with session.get(page_url) as response:
page_source = await response.text()

email_addresses = re.findall(email_pattern, page_source)
if email_addresses:
email_found = True
break

if not email_addresses:
email_addresses = ['No email found']

# Store the result in the cache
cache[website_url] = email_addresses

# Store the results in the output DataFrame
output_df.loc[index] = [row['ID'], website_url, ', '.join(email_addresses)]

print(f"Processed {website_url}, Email Addresses Found: {', '.join(email_addresses) if email_addresses else 'No email found'}")

except Exception as e:
print(f"Error: {str(e)}. Skipping entry {index} and continuing to the next.")

The DataFrame is processed by the scrape_batch function, which controls the scraping operation in batches. Using the scrape_website function, it generates a list of asynchronous jobs to simultaneously scrape each website. The efficiency of dealing with huge websites is increased by this simultaneous execution. The code determines the range for each batch, starts asynchronous scraping operations with asyncio.gather, and makes sure that results are handled simultaneously. Extracted email addresses are sorted according to the website ID and saved to a CSV file after finishing a batch.


# Define an asynchronous function to scrape websites in a specified range for each batch
async def scrape_batch(start_index, end_index):
async with aiohttp.ClientSession() as session:
tasks = []

for index in range(start_index, end_index):
if index >= len(df):
break
row = df.iloc[index]
task = scrape_website(session, index, row)
tasks.append(task)

await asyncio.gather(*tasks)

while start_index < len(df):
# Calculate the end_index for the current batch
end_index = start_index + batch_size

# Check if the end_index exceeds the length of the DataFrame
if end_index > len(df):
end_index = len(df)

# Display progress for the current batch
print(f"Scraping batch {start_index + 1}-{end_index}...")

# Run the scraping function for the current batch using asyncio.run()
asyncio.run(scrape_batch(start_index, end_index))

# Sort the DataFrame by the 'ID' column
output_df.sort_values(by='ID', inplace=True)

# Construct the output file name with end_index included
output_csv_file = f'scraped_emails_end{end_index}.csv'

# Save the sorted results to a CSV file for the current batch
output_df.to_csv(output_csv_file, index=False)
print(f"Saved sorted results (end_index={end_index}) to {output_csv_file}")

# Update start_index for the next batch
start_index = end_index

PartC: Filling Missing Data and Merging Datasets

C.1: Filling Missing Data and Preparing for Merging: A Pandas DataFrame is used in the procedure to import data from CSV files called scraped_emails_end7400 and scraped_emails_end23297.csv, which are then sorted according to their ‘ID’ column. To find missing IDs in the dataset, a range of anticipated IDs is constructed from the first ID to the final ID in the sorted combined_df. The list of predicted IDs is then compared to the actual IDs in the combined_df to discover which IDs are missing. For every missing ID, a new entry is made with the email address “Skipped During Scrapping.” The DataFrame is then sorted once again, keeping the order determined by the “ID,” and its index is reset to guarantee a continuous index sequence. In order to ensure that any missing data is correctly filled in and to create a dataset that is prepared for merger with another dataset, the filled and changed DataFrame is saved as a new CSV file called “Scrapped_email.csv.”

import pandas as pd

# Load the first CSV file
file1 = "scraped_emails_end7400.csv"
df1 = pd.read_csv(file1)

# Load the second CSV file
file2 = "scraped_emails_end23297.csv"
df2 = pd.read_csv(file2)

# Concatenate the two DataFrames
combined_df = pd.concat([df1, df2], ignore_index=True)

# Drop the 'Website' column
combined_df = combined_df.drop(columns=['Website'])

# Sort the DataFrame by the 'ID' column
combined_df.sort_values('ID', inplace=True)

# Create a range of expected IDs from the first to the last element
expected_ids = range(combined_df['ID'].iloc[0], combined_df['ID'].iloc[-1] + 1)

# Find missing IDs by comparing the expected IDs with the actual IDs in the DataFrame
missing_ids = set(expected_ids) - set(combined_df['ID'])

# Print the missing IDs
print("Missing IDs:", missing_ids)

# Insert new entries for the missing IDs with 'no email found' as Email_Homepage
new_entries = [{'ID': id, 'Email': 'Skipped During Scrapping'} for id in missing_ids]
combined_df = pd.concat([combined_df, pd.DataFrame(new_entries)], ignore_index=True)

# Sort the DataFrame again by 'ID' to maintain order
combined_df.sort_values('ID', inplace=True)

# Reset the index of the resulting DataFrame
combined_df.reset_index(drop=True, inplace=True)

# Save the combined and filled DataFrame to a new CSV file
combined_df.to_csv('Scrapped_email.csv', index=False)

C.2 : Merging Two Datasets Based on a Common Identifier:

The second part of the process involves combining two datasets, Google_partner_df and scrapped_email_df, to consolidate information. The first dataset is loaded from a CSV file, while the second is from a previously prepared DataFrame. The merging operation uses the pd.merge() function to combine the two datasets based on a common identifier, with an inner join to retain records with matching ‘ID’ values.

The ‘ID’ column is moved to the top position for data integrity and displayed by the code, which reorders the columns in the merged_df DataFrame. Data analysis and visualisation chores are made easier as a result. In order to show how two datasets may be combined based on shared IDs, the combined dataset is saved as a new CSV file.

#merge two data sets:

import pandas as pd

# Load the first CSV file with Google Partner data
google_partner_file = "Google_partner_with_id.csv"
google_partner_df = pd.read_csv(google_partner_file)

# Load the second CSV file with Scrapped Email data
scrapped_email_file = "Scrapped_email.csv"
scrapped_email_df = pd.read_csv(scrapped_email_file)

# Merge the two DataFrames based on the 'ID' column
merged_df = pd.merge(google_partner_df, scrapped_email_df, on='ID', how='inner')
# Shift the 'ID' column to the first position
merged_df = merged_df[['ID'] + [col for col in merged_df.columns if col != 'ID']]

merged_df
# Save the merged DataFrame to a new CSV file
merged_df.to_csv("Web_scrapping_first_attempt.csv", index=False)

Part D: Data Preprocessing

D.1: Categorizing data

The dataset is then divided into three groups depending on the information in the ‘Email’ column: entries with ‘No email found,’ ‘Skipped During Scrapping,’ and entries with valid email addresses. A separate CSV file is saved for each category. Additionally, a subset of the data is created for additional research by filtering out items with “http” missing from their website URLs. The dataset is efficiently separated and organised by this code, which makes it easier to maintain for further data processing and analysis operations.

#Import libraries
import numpy as np
import pandas as pd

#loading data
My_scrapped_file = './Web_scrapping_first_attempt.csv'
df = pd.read_csv(My_scrapped_file)

#Dividing dataset into three categories based on email

# that contains no email found
df1 = df[df['Email'] == 'No email found']
df1.to_csv('No_email.csv',index= 'False')

#that contains Skipped During Scrapping
df2 = df[df['Email'] == 'Skipped During Scrapping']
df2.to_csv('Skipped.csv',index='False')
#that contains emails
df3 = df[(df['Email'] != 'No email found') & (df['Email']!= 'Skipped During Scrapping')]
df3.to_csv('Scrap_cleaning.csv')

#join that doesnt contain email
df4 = pd.concat([df1,df2],ignore_index='True')
df4.to_csv('Residual_Scrapping.csv', index = 'False')

without_http = df[~df['Website'].str.contains('http')]

D.2: Complete Data Cleaning:

On a dataset of email addresses, we run a number of data cleaning and categorisation procedures in this part. Here is a list of the essential actions:

  1. Loading Data and Removing Unnamed Column: The code initially loads data into a Pandas DataFrame called df from a CSV file called “Scrap_cleaning.csv.” The data structure is then cleaned up by removing an anonymous column from the DataFrame.
import pandas as pd
import re
my_file = './Scrap_cleaning.csv'
df = pd.read_csv(my_file)
df.drop(columns=['Unnamed: 0'], inplace=True)

2. Converting Email Addresses to Lowercase: To maintain uniformity in style, all email addresses in the ‘Email’ column are changed to lowercase.

# Convert all email addresses in the "Email" column to lowercase
df['Email'] = df['Email'].str.lower()

3. Custom Function to Remove Email Extensions: To remove email addresses with specific extensions (such as “wixpress,” “jpg,” “jpeg,” or “png”) from a given cell value, a custom function called remove_extensions is developed. Email addresses are split apart by the programme, which then rejoins the cleaned email addresses after removing any undesirable extensions.

# Function to remove entries containing specified extensions from a given cell value
def remove_extensions(emails):
extensions_to_remove = ['wixpress', 'jpg', 'jpeg', 'png', 'svg','webp','sentry','@example','tiff','example@']

email_list = re.split(r',\s*|\s*,', emails) # Split using either ', ' or ','
valid_emails = [email for email in email_list if not any(ext in email for ext in extensions_to_remove)]

return ', '.join(valid_emails) if valid_emails else None

# Apply the updated function to the "Email" column
df['Email'] = df['Email'].apply(remove_extensions)

4. Cleaning Duplicate Email Addresses: By dividing the email lists, eliminating duplicates, and then reuniting them into a single string, duplicate email addresses inside each cell are eliminated. By doing this, it is made sure that there are no duplicate email addresses.

# Check for duplicated rows in the combined_df DataFrame
duplicates = combined_df.duplicated()
# Count the duplicated rows
count_of_duplicates = duplicates.sum()
# Remove duplicates within each list and join them back into a single string
cleaned_email_lists = df['Email'].apply(lambda x: ', '.join(set(x.split(', '))) if x else None)

5. Custom Function to Remove Numbers at the Beginning: The remove_numbers_at_beginning custom function is designed to eliminate the leading zeros from each email address.


# Function to remove numbers at the beginning of each email address
def remove_numbers_at_beginning(emails):
if emails is None:
return None

email_list = emails.split(', ')
cleaned_emails = []

for email in email_list:
# Check if the email is empty or starts with digits
if email and email[0].isdigit():
# Use regex to remove numbers at the beginning
cleaned_email = re.sub(r'^\d+', '', email)
cleaned_emails.append(cleaned_email)
else:
cleaned_emails.append(email)

return ', '.join(cleaned_emails)

# Apply the function to the "Email" column
df['Email'] = df['Email'].apply(remove_numbers_at_beginning)

6. Email Categories: Email addresses are divided into four categories using a custom function called categorize_email: legitimate emails, “Invalid Email,” “No Email Found,” and “Skipped During Scrapping.”

def categorize_email(email):
if email == "No email found":
return 'No email found' # No email found
elif email == "Skipped During Scrapping":
return 'Skipped During Scrapping' # Skipped During Scraping
elif email == "Invalid Email":
return 'Invalid Email' # Inalid email
else:
return 'Valid Email' # Vtaalid Email

# Apply the custom function to create the 'Email_category' column
combined_df['Email_category'] = combined_df['Email'].apply(categorize_email)

# Display the DataFrame with the new 'Email_category' column
print(combined_df)

7. Saving the Final DataFrame: The combined_df, the final cleaned and organised DataFrame, is saved to a CSV file (‘final.csv’) without the index.

# Save the DataFrame to a CSV file without including the index
combined_df.to_csv('final.csv', index=False)

A clean and organised dataset of email addresses is created using this code through data cleansing, extension removal, duplicate management, and categorisation in preparation for future research.

Part E: Visualization using Tableau

Visualization by the author using Tableau

This image can be viewed using Tableau here.

Conclusion

Email addresses were successfully collected from a website dataset using Python’s asynchronous techniques in conjunction with Google Chrome Driver and Selenium. Through the use of parallel task execution, this method was able to speed up data extraction. In order to provide programmatic control of online interactions, navigation, and content extraction, Selenium built a bridge between the code and the Chrome browser.

From the visualization, we can see that 53.70% of emails are scrapped from the website dataset. Out of 23,297 entries, in the 8000 websites email addresses were not found. After checking some samples, we observed that there were no email addresses provided. During the scrapping process, around 1540 entries were skipped i.e. there were no outputs from these websites. Which can be further scrapped after separating this dataframe. We did as well and were able to recover approx. 100 emails id from it. Approx. 1000 websites had only pseudo email format in the form of .png, .svg, .wixpress.com etc. that we termed as invalid emails.

In summary, asynchronous methods and tools like Chromedriver/Selenium offer superior efficiency, automation, and versatility when compared to Beautiful Soup. These advantages make them a preferred choice for tasks that involve complex web interactions and require rapid and efficient data extraction, such as email extraction from websites.

References:

  1. Dilmegani, C. (2023, March 12). Is Web Scraping Legal? Ethical Web Scraping Guide in 2023. AIMultiple.
  2. Kaggle. (2023, Sep 4). Websites Domain in the UK and US.

FOLLOW ME to be part of my Data Analyst Journey on Medium.

Let’s get connected on Twitter or you can Email me for project collaboration, knowledge sharing or guidance.

--

--