The Web Scraper That Got My IP Address Black-Listed From Zillow (Part I)

A guide for giving your web scraper wings, and a warning against flying too close to the Sun.

Ryan Sherby
Pipeline: Your Data Engineering Resource
7 min readApr 28, 2023

--

Photo by Zdeněk Macháček on Unsplash

Today I’m turning the publication over to fellow data professional Ryan Sherby to tell the story of how he built on an existing project of mine to efficiently scrape housing data.

This is the first installment in a two-part series.

“How Do I Scrape Housing Data?”

If you search this question on Google, you’ll likely find an overwhelming number of results. It can quickly feel like finding a proper guide is more difficult than demystifying the process itself.

The resource I used to guide my development of the web scraper is an article published by Zach Quinn. This article contains great advice on dynamic URL creation, effective HTML parsing, and Pythonic ETL.

I strongly encourage you to check out that article now before proceeding.

Let’s briefly look at a modified web scraper built using the logic outlined in this article. For a full reading of the code, this Jupyter Notebook is available.

The Setup

This is my version of the Zillow web scraper in its alpha form. Running this version will not get your IP address black-listed, so you’re free to use it as much as you’d like.

First, let’s get those basic imports out of the way.

import re
import datetime as dt
import json

import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

Second, you need a configuration file.

# config.py

OW_API_CALL="http://api.openweathermap.org/geo/1.0/zip?zip={zip},US&appid={api_key}"

API_KEY=<your_api_key>
# GET YOUR OWN AT OPENWEATHER.COM


ZILLOW_HOMES_URL="https://www.zillow.com/{city}-{state}/{property_type}/{page}_p"

HOME_HEADER_CLASS=("div",{"class":"StyledPropertyCardDataWrapper-c11n-8-82-3__sc-1omp4c3-0 iKSlva property-card-data"})
HOME_PAGE_CLASS=("span",{"class":"Text-c11n-8-82-3__sc-aiai24-0 rmCHB"})
HOME_PRICE_CLASS=("div",{"class":"StyledPropertyCardDataArea-c11n-8-82-3__sc-yipmu-0 gMDnGj"})
HOME_SPACE_CLASS=("ul",{"class":"StyledPropertyCardHomeDetailsList-c11n-8-82-3__sc-1xvdaej-0 kVWnTV"})
HOME_ADDRESS_CLASS=("a",{"class":"StyledPropertyCardDataArea-c11n-8-82-3__sc-yipmu-0 hiBOYq property-card-link"})

LIST_ITEM_CLASS="li"
LIST_ITEM_IDENTIFIER="abbr"


ERROR_404_CONSTANT='zillow-error-page'
BOT_CAUGHT_CONSTANT='robots'

Third, you need the core program.

# main.py

city_states=list[tuple[str,str]]

l=[]
h={}

for city, state in city_states:
while True:

# GET THE PAGE MAX

break


for p_idx in range(2,(_page_max+2)):
while True:

# REQUEST THE PAGE DATA

for home in homes:

# EXTRACT PRICE DATA


# EXTRACT SPACE DATA


# EXTRACT ADDRESS DATA


# REQUEST GEO-COORDINATE DATA USING OPENWEATHER API


l.append(h)
h={}
break

The core program extracts the number of pages, iterates through each page, and parses the relevant data. To acquire geo-coordinate data, a call is made to the OpenWeather API using each home’s zip code as a key.

The data is structured as a list of dictionaries, with each dictionary containing information for a single home.

Lastly, this list is converted to a DataFrame.

df=pd.DataFrame(l)

This script will do its job provided that you populate city_states, update the HTML class definitions, and navigate to OpenWeather.com to get your own API key.

Dataframe output. Black with white text.
Sample Output. Image generated by the author.

Onwards and Upwards

In total, it takes about 1 minute to collect all the data for a big city like Houston, Texas.

This is good enough for most purposes, but let’s increase the scale. What if you wanted housing data for every city in the US with a population greater than 25,000?

If we take an average of 40 seconds per run and combine that with the city criteria, we get a total runtime of about 8 hours. That’s… not great.

The first step to improving our script is to identify where the biggest time wasters are. We can run each section of our script individually to capture how long it takes. After running each section a hundred times, we find:

Dot graph on a white grid background.
Rutime by process in milliseconds. Graph generated by the author.

By far the Zillow webpage call takes the longest, and the Openweather API call is nothing to scoff at — especially when it is made ~9 times per page!

If you’ve worked with Input/Output (I/O) operations before, these results probably don’t surprise you.

The time it takes to prepare the request and parse the response is a fraction of the time it takes to send the request and receive the response.

With I/O operations identified as our biggest time-wasters, we can begin exploring possible solutions.

Memoization

At first glance, this word kind of looks like a toddler trying to say “memorization”. In fact, it would probably be simpler if we just referred to the concept as memorization, as that is basically what it is.

If I asked you which bird has the largest wingspan, you’d probably spend a minute on Google and tell me “an albatross”. And if I asked you again, “Which bird has the largest wingspan?” would you spend another minute on Google? Of course not! Because you’ve already memorized the answer.

Memoization is the process of storing the results of expensive function calls and returning those cached values if the same inputs are received again.

Googling something is a time-consuming procedure relative to accessing the information from your memory. It follows then that a computer should also minimize the time spent searching for an answer that it already knows.

In choosing a candidate for memoization, there are a few factors we must consider:

What is the likelihood of duplicate values in the data?

Memoization is most effective when there is a high likelihood of repeated calls to a function with the same arguments. If the function is rarely called with the same arguments, then memoization may not provide much benefit.

What is the mutability of the data relationship?

If the data relationship is mutable, meaning it can be changed over time, then memoization may not be appropriate as it could result in stale or incorrect data being returned.

What is the space complexity of the stored data?

Memoization can use a lot of memory if the function has a large number of possible input values. It is important to consider the space complexity of the stored data and ensure that it does not use too much memory.

How often is the function called?

Memoization is most effective when the function is called frequently. If the function is rarely called, then the overhead of memoization may not be worth it.

This makes the OpenWeather API call an excellent candidate for memoization.

  • Zip codes are repeated throughout several records per city
  • The relationship between zip code and geo-coordinate is immutable
  • The stored data is only 3 columns long (zip code, latitude, longitude)
  • The function is called about 9 times per page

First, we modularize the API call as a function.

def OW_api_call(dict):
header={"content-type": "application/json"}
req=requests.get(OW_API_URL.format(zip=dict['zip'],
api_key=API_KEY,
headers=header)

json_loads=json.loads(req.text)

try:
dict['lat']=float(json_loads['lat'])
dict['lon']=float(json_loads['lon'])

except:
dict['lat']=None
dict['lon']=None

return json_loads

We then design a decorator function that will store the responses of the API call in memory and circumvent the API call if the zip code is already memo(r)ized.

memory = {}

def memoize_zip_codes(func):
def wrapper(dict):
if dict['zip'] not in memory:
json_loads=func(dict)
# CALL OW_api_call
try:
memory[dict['zip']]={'lat':json_loads['lat'],
'lon':json_loads['lon']}
# ADD NOVEL ZIP CODE AND COORDINATES TO memory
except:
pass
else:
dict['lat']=memory[dict['zip']]['lat']
dict['lon']=memory[dict['zip']]['lon']
# GET ZIP CODE COORDINATES FROM memory
return func
return wrapper

Here is an example of our memory cache.

Program output; white text on black background.
Sample Memory Cache. Image generated by the author.

Defining our memoization function as a decorator allows it to control the execution of our API call function without modifying it.

If you’re a bit confused about what is going on here, I suggest you check out this primer on decorators.

To call the decorator function, we assign it to the main function using the @ symbol.

@memoize_zip_codes
def OW_api_call(dict):
# CONTENTS OF FUNCTION ...

We then call the decorated function like any other function.

# CONTENTS OF SCRIPT ...
try:
h['zip']=a_list[2].strip().split()[1]
except:
h['zip']=None

OW_api_call(h)

# CONTENTS OF SCRIPT ...

A Feather In Our Cap

On average, it takes less than 1 ms to access an item in memory. This means that for every API call we circumvent by accessing a known zip code, we save 130 ms.

For a large city like Houston, this relatively minor change has a massive effect. We see a reduction in run-time of over 20 seconds!

Applying this decrease as a constant percentage, the total run-time for our large-scale program should take about 5 hours. Is it better? Yes.

Is it the best it could be? Definitely not.

But we’ll improve in Part II.

If you enjoyed this article, please consider following me and this publication to stay updated.

--

--