The Web Scraper That Got My IP Address Black-Listed From Zillow (Part II)

A guide for giving your web scraper wings, and a warning against flying too close to the Sun.

Ryan Sherby
Pipeline: Your Data Engineering Resource
7 min readMay 4, 2023

--

A humming bird on a green background.
Photo by Zdeněk Macháček on Unsplash

Today I’m once again turning the publication over to fellow data professional Ryan Sherby to tell the story of how he built on an existing project of mine to efficiently scrape housing data.

This is the second installment in a two-part series.

“How Much Higher Can We Go?”

Now that we’ve got a taste for flight, it’s time to answer our biggest question. Memoization put us in the sky, but we’re going to the Moon.

If you’re not sure you’ve left the ground just yet, or you’d like to shake the water off your wings, check out Part I of our series. It provides an in-depth guide to understanding and applying memoization.

Let’s warm up our wings with a quick recap of our goal. As always, for a full reading of the code, this Jupyter Notebook is available.

The Flight Path

In total, it took about 8 hours for our alpha program to scrape the data for all US cities with a population greater than 25,000.

After identifying an opportunity to use memoization to reduce the number of calls made to the Openweather API, we brought this time down to 5 hours.

We can definitely improve on this, but first, we have to go back to the drawing board.

Runtime by process in milliseconds. Graph generated by the author.

With memoization, we avoided I/O operations as much as possible, but what about when the operations are unavoidable?

Such is the case for our Zillow webpage call. This call needs to be made every time — several thousands of times. The solution then must accept the fact that this call must always be executed.

However, nothing says that we are required to execute each call subsequent to the completion of the former call. This requirement is often silently assumed because it conforms nicely to the generally sequential nature of most programs.

For example, take the default processing order for a series of nested ‘for’ loops.

for i in range(1,4):
for j in range(1,4):
for k in range(1,4):
print(i,j,k)

This logic is traversed sequentially, with an index (i,j,k) only progressing to its next value upon completion of the nested loops within it. This can be seen visually when we examine the output.

Sample Output. Generated by the author.

For our small example, there would be no benefit to avoiding sequentiality because the program is run continuously. But the I/O operations executed by our web scraper are not performed continuously; there are several pauses as we wait for each response.

In theory, we should be able to use this downtime to prime other I/O operations for execution. The next step is finding the logical solution to implement this in practice.

Multi-Threading

A thread represents a string of instructions that are passed to your OS for scheduling and execution. Most operations are executed using a single thread, and this includes the majority of Python programs.

In Python, the Global Interpreter Lock (GIL) is what ensures code is executed using only a single thread at a time. It exists as a simple way to prevent undefined behavior.

However, the GIL says nothing about separating the execution of a program into multiple threads. Each of these threads can be executed individually as soon as there are available resources.

Multi-threading is the ability of a CPU to allow for execution of a different thread any time the current thread frees up its resources.

At any moment there is downtime on the current thread, a new thread is spawned to absorb the available resources. This allows us to squeeze every last bit of runtime optimization out of our program.

When considering multi-threading as a way to improve a program’s performance, it is important to consider:

Are the operations selected for multi-threading independent?

Multi-threading works best when the operations grouped together are independent. If the operations depend on each other, a thread may need to wait for other threads to complete before moving on to the next task. Threads need to be able to perform without reliance on the successful completion of other threads in the pool.

Does the program have a lot of idle time during which other threads could execute useful work?

Multi-threading is extremely effective in programs where there is unavoidable idle time. Pauses in the current main thread free up resources for the execution of new threads. This allows the program to maximize throughput by utilizing the CPU during these idle periods to execute other useful work concurrently.

This makes the call to the Zillow web page an excellent candidate for multi-threading.

  • Each call is independent and does not depend on the success or failure of the other calls
  • As an I/O operation, the calls have a lot of idle time as they wait for the response from the server to continue their operations

While each call is independent, the current process is not structured in a way that highlights their independence. We need to restructure the process so that independent actions are grouped together.

This library requires a special built-in library called concurrent.futures, so we will import that here.

import concurrent.futures

The first grouping of independent actions will be the creation of the URLs. We will design a function to modularize this process to avoid duplicating work.

def generate_urls(city_states,property_type:str,page_max_dict:dict=None):
urls=[]
for city,state in city_states:
try:
page_max=page_max_dict[(city,state)]
except:
page_max=1
for page in range(2,page_max+2):
urls.append(((city,state),ZILLOW_HOMES_URL.format(
city=city.lower(),
state=state.lower(),
property_type=property_type,
page=page)
)
)
return urls

This function works dynamically to generate #page_max URL(s) depending on the value of a (city, state) tuple treated as a key for the page_max_dict. The function returns a list of tuple objects containing city-state information along with the requested URL.

This next function will use these URLs to make the actual requests of the Zillow webpage.

def get_soup(city_state_tup,url):
while True:

header = # RANDOMIZED HEADER
req=requests.get(url,
headers=header)
req_text=req.text
soup=BeautifulSoup(req_text,'html.parser')

# ERROR HANDLING ...

break
return soup

This function is pretty simple and just returns the “soupified” version of the HTML response. It is applied to a single URL rather than a list of them.

This next function is what actually loops through the URLs to apply the get_soup() function. In combination with that function, these requests make up our second grouping of independent actions.

def threaded_request(func,urls):
soups=[]
with concurrent.futures.ThreadPoolExecutor() as executor:
future_to_url = {executor.submit(func,url):
city_state_tup for (city_state_tup,url) in urls
}
for future in concurrent.futures.as_completed(future_to_url):
city_state_tup = future_to_url[future]
try:
soups.append((city_state_tup,future.result()))
except:
pass
return soups

This function makes use of the library we imported earlier to create an instance of the ThreadPoolExecutor() class. This class will use its submit() method, which accepts a function argument in place 0 and parameter arguments in places 1+, to execute the passed function func.

The main function then loops through the completed threads and appends them to soups as they are completed. The associated city-state is tracked with the city_state_tup, which is a variable that has been tracked from URL creation to actual request.

When we put this all together, we get a very clean web scraper with obvious groupings.

# main.py

city_states=list[tuple[str,str]]
l=[]
h={}
page_max_dict={}
page_max_urls = generate_urls(city_states,property_type='homes')
page_max_soups = threaded_request(func=get_soup,urls=page_max_urls)
for city_state_tup,soup in page_max_soups:
# GET THE PAGE MAX
page_max_dict[city_state_tup]= # PAGE MAX

main_urls=generate_urls(city_states=city_states,
property_type='homes',
page_max_dict=page_max_dict
)
main_soups=threaded_request(func=get_soup,urls=main_urls)
for city_state_tup,soup in main_soups:
# REQUEST THE PAGE DATA
for home in homes:
# EXTRACT PRICE DATA
# EXTRACT SPACE DATA
# EXTRACT ADDRESS DATA
# REQUEST GEO-COORDINATE DATA
l.append(h)
h={}

It is important that each city-state is tracked alongside its URL(s) with the city_state_tup variable, as we need a way to determine which city-state a URL response belongs to.

In our old program, this relationship was preserved by sequentiality; the response for a URL was immediately passed to the next function to request the associated information. For example, the page max for San Antonio would be immediately passed to the function that requests that number of pages for San Antonio.

But with multi-threading, we never know the order in which the threads will be scheduled. The first thread executed in the first pool could be the page max call for San Antonio, and then when we begin requesting the actual pages with the main call, the first thread executed in that next pool could be Houston.

A Second Feather Is All The Better

When we run our new multi-threaded program, we get an average runtime of 12.5 seconds for a large city like Houston. This represents a decrease in processing time of over 75%!

If we apply this percentage to the previous average runtime of 40 seconds, we find that an average run should take less than 10 seconds.

We should be able to pull all the data for our chosen cities in only 2 hours. Woohoo!

But alas… we’ve accidentally soared right past the Moon and into the Sun. To avoid going too high, I would recommend keeping your scrape to only a single city at a time.

Nonetheless, I’m proud of you for being so bold, and I hope you can use what we’ve learned to (carefully) go even higher.

And if you find that you really need a lot of housing data at scale, check out the RedfinScraper library.

That’s all for now. Thanks for flying with me!

If you enjoyed this article, please consider following me and this publication to stay updated.

--

--