Image by Freepik

Step by Step Tutorial:Extracting and Storing Real Estate Prices Using Python,Selenium and AWS S3

Chesta Dhingra
Data And Beyond
Published in
5 min readMay 17, 2024

--

In today’s data-driven world, the ability to efficiently gather and process information is crucial. Our latest project focuses on two pivotal stages of the ETL process — Extraction and Loading — tailored specifically for real estate data analysis.

Extraction: Ethical Scraping from Realtor.com

Our journey begins with the extraction of house pricing data from realtor.com, a leading property listing website. It’s vital to address the ethical considerations of web scraping right from the outset.

Our approach ensures compliance with the website’s terms of service and respects data privacy regulations. We utilize Python’s Selenium library, a robust tool for interacting with web elements, particularly effective with JavaScript-heavy sites.

Selenium isn’t just about simple data retrieval; it enables us to simulate real-user interactions. For enhanced efficiency and to respect site resources, we also employ a headless browser setup using the undetected-chrome library. This setup allows our script to scroll through web pages, mimicking human behavior, and thereby loading more data dynamically, including the crucial card-price details (prices of houses) of listed houses.

Loading: Storing Data Securely on AWS S3

After successfully extracting data, the next critical step in our data pipeline is securely loading it into an AWS S3 bucket. This process involves several important actions to ensure both security and functionality.

  1. Setting Up an IAM User:- Firstly, create an IAM user in AWS. This user will have specific permissions necessary for the tasks at hand, without having unrestricted access to other AWS services. Grant programmatic access to this user via AWS CLI by attaching necessary policies, such as AWSAdministratorAccess and AWSS3FullAccess. This selective permission approach minimizes security risks while enabling necessary operations.
  2. Managing User Credentials:- Post creation, download the user’s credentials in a CSV file for record-keeping. Additionally, generate the User Access Key ID and Secret Access Key through the security settings of the IAM user. These keys are vital for integrating our Python script with the AWS S3 service, as they authenticate and authorize our actions programmatically.
  3. Creating the S3 Bucket:- Log in using the IAM user credentials to create an S3 bucket. When setting up the bucket, carefully adjust its access settings: 1) Uncheck the option “Block all public access” if the application requires external accessibility. However, be cautious with this setting to avoid unintended public access. 2) Configure a bucket policy to specify access permissions. Here is a template for such a policy, which you’ll need to customize with your specific details (IAM user account ID, user name, and bucket name):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::iamuseraccountID:user/nameofIAMUser"
},
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": "arn:aws:s3:::bucketname/*"
}
]
}

Ensure to replace placeholders with your actual IAM user account ID, IAM user name, and the bucket name.

With these steps, your AWS IAM account and S3 bucket are ready to securely store the data extracted by your Python script. This setup not only ensures the integrity and confidentiality of your data but also provides a scalable solution for your data storage needs.

In this section, we will explore the complete Python script that integrates Selenium for extracting real estate data and subsequently loading it into an AWS S3 bucket. This script is a practical example of how powerful tools can be combined to automate and streamline data collection and storage processes.

Prerequisites

  • Ensure that you have installed Python, Selenium, boto3 (AWS SDK for Python), and other necessary libraries.
  • Configure your AWS credentials (as discussed in the previous sections) to allow programmatic access to your S3 bucket.

Python Script Overview

  1. Data Extraction Using Selenium: We’ll extract real estate pricing data from a webpage.
  2. Data Loading to AWS S3: We’ll securely upload the extracted data to the S3 bucket.
import boto3
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def fetch_prices(url):
"""
Fetches prices from the given URL using Selenium and returns them as a list.
"""


chr_options = Options()
chr_options.add_experimental_option("detach", True)
chr_options.add_experimental_option('excludeSwitches', ['enable-logging'])

# Initialize the driver
driver = webdriver.Chrome(service=Service('C:/Users/Chesta/ETL_real_estate_project/chromedriver.exe'), options=chr_options)
driver.get(url)
WebDriverWait(driver,25)
driver.maximize_window()

# Extracting price elements based on their XPATH. Update the XPATH according to your target website.
wait = WebDriverWait(driver,25)
wait.until(EC.presence_of_element_located((By.XPATH, "//*[@data-testid='card-price']")))
prices = driver.find_elements(By.XPATH, "//*[@data-testid='card-price']")
price_list = [price.text for price in prices]
print(price_list)

driver.quit()
return price_list

def upload_data_to_s3(data, bucket_name, file_name):
"""
Uploads given data to a specified S3 bucket using explicit AWS credentials.
"""
# Create a session with explicit credentials
session = boto3.Session(
aws_access_key_id='accesskeyid',
aws_secret_access_key='secretaccess',
region_name='us-east-1' # e.g., 'us-east-1'
)
# print("connected with aws account")

# Creating a connection to S3 using the session
s3 = session.client('s3')
print(s3)

# Data needs to be in byte-form to upload to S3
data_string = '\n'.join(data).encode('utf-8')

# Uploading data to the specified bucket under the specified file name
s3.put_object(Bucket=bucket_name, Key=file_name, Body=data_string)
print("Data uploaded successfully to S3.")

if __name__ == "__main__":
# URL of the page you want to scrape
url = 'https://www.realtor.com/realestateandhomes-search/Henderson_NV/show-price-reduced/sby-6'

# Fetch prices from the website
prices = fetch_prices(url)

# S3 Bucket and file name where data will be stored
bucket_name = 'nameofthebucket'
file_name = 'prices.txt'

# Upload the prices to AWS S3
upload_data_to_s3(prices, bucket_name, file_name)

After our data has been uploaded to AWS S3, it’s good practice to verify that everything went as planned. Here are straightforward command line instructions to check the files in your S3 bucket and to copy a file from S3 to your local machine for a sanity check.

## to check the list of files in S3 bucket
C:\Users\Chesta>aws s3 ls s3://extracted-data-store/prices.txt
2024-05-09 01:20:20 71 prices.txt

## copy the .txt from S3 bucket to local for the sanity check of data is loaded or not
C:\Users\Chesta>aws s3 cp s3://extracted-data-store/prices.txt C:\Users\Chesta\Downloads\prices.txt

##download the prices.txt file from your S3 bucket to the Downloads
## directory on your local machine. If the download is successful, you should see a message similar to:
download: s3://extracted-data-store/prices.txt to Downloads\prices.txt

Hope you had as much fun reading this as I did writing it! Got thoughts, feedback, or just want to chat about it? Drop your comments below — I’d love to hear from you!

--

--