Web scraping and storing unstructured data: An example of scraping Apartments.com
A combination of web scraping and a document database in Python offers a convenient solution to collecting and storing unstructured information from websites.
A large quantities of data are available on the internet for almost free and can potentially be used to generate valuable insights in various domains. However, such data are often available in unstructured format, and downloading and storing is remains a challenging task. While web scraping from many websites may not be difficult, a reliable, automated web scraper must be able to quickly store the information without raising any errors in such a way that data can be cleaned and processed later.
Earlier I only used SQL database to save the downloaded data, which meant I had to force all the information into a tabular format with known column names. This practice worked when the information was always available in a table format. However, when I tried storing property information from (e.g., Apartments.com), I couldn’t scrape as the script frequently stopped due to mismatch errors in data format. Using SQL database for non-tabular details required constant manual intervention. Then someone suggested using non-relational databases, which are designed for this specific purpose.
In this blog, I show you how a combination of web scraping and document database in Python can be used to retrieve lots of property details, which can be later used for a model building exercise. The overall steps are displayed in the flow chart and explained in the following sections.
Generating base URLs
Apartments.com displays the property listings for individual cities. Each city has a unique URL, which contains the names of the city as well as the state the city is located in. For example, all the rentals in New York City are posted at https://www.apartments.com/new-york-ny/. It may also be noted that the URL https://www.apartments.com/new york-ny/ also seems to work (a blank space instead of a dash ‘-’).
The URLs in the above format can be manually generated for a handful of cities by adding city and state details. However, downloading the property data of many cities can be painful if done by hand. Luckily, there is always a Pythonic way of doing things. Python has ‘uszipcode’ package that provides city and state names among other geographical details of all the cities in the country. The following code shows how we can obtain the list of the URLs that we need to get the data.
The following block shows the shape as well as the contents of the above geo dataframe by printing out the shape and the first five rows. Although we have the URLs of all cities in the United States, we will restrict this exercise to exploring the rentals in New York City (NYC).
Adding rent-range in URLs
The base URL of a city only displays a maximum of 25 properties per page and a total of 28 pages, so we need to also iterate over all pages. Modifying the URLs to include page number is easy as illustrated below:
However, one can only browse up to 700 properties (25*28) without mentioning any search criteria. For cities with more than 700 listings, we have to narrow down our search so that we can collect details on as many property listings as possible. NYC seems to have roughly 15,000 properties, but it should also be noted that a single advertisement may have multiple sub-units available. This implies that the number of actual available listings could be smaller than 15,000, but still much higher than 700.
I find entering a rent-range to be a decent way of achieving that goal (ensuring that smaller than 700 properties are displayed). For instance, a base URL can further modified to look for properties between a small range of rents (increase in steps of $ 200 up to say $15,000). Below is a code snippet to show how this can be done:
The output of the above code is shown below:
The above exercise must be done for each base URL. If the city has 700 or more properties (checked by a proxy indicator of number of pages being equal to 28). In the case of fewer than 28 pages, this step can be skipped.
While this is not perfect, it still allows us to scrape as many listings as possible. It is possible that a property gets displayed for under multiple rent categories but because each property has a unique id, we can just check whether we have already collected the information on the property and avoid repetition as described in the next sections.
Adding page number in URLs
Once we have narrowed our search by incorporating the rent information in the base URL of the selected city (in this case NYC), we can begin our queries. Let’s start with a link https://www.apartments.com/new-york-ny/1300-to-1500/. This link directs us to the first page by default.
The first page of any URL shows the total number of pages (or the page range), which allows us to know how many page URLs we have to visit for that rent-range. As illustrated below, for the URL, there are 12 pages. We can use this information and create page-specific URLs for that price-range.
Using document database
A document database differs from a typical relational database (e.g. SQL). A document database has several collections (analogous to tables in SQL) and each collection can have several documents (analogous to rows in SQL) with flexible scheme. A relational database, on the other hand, requires normalization, which means that data must be organized in a pre-defined structure for safety and storage efficiency.
While a relational database has many advantages in terms of being consistent and intuitive, they are terrible when new data has no key/value structure and keeps getting modified in one way or the other, which is often the case during web scraping. In this blog, I used MongoDB, which is a popular document database, for storing property details.
Retrieving property information
Each URL (with or without the fine-tuned search criteria) only provides basic information about each renting advertisement, which includes the property URL. The property URL is the personal webpage of that property and has all the details, such as amenities, rent, availability of sub-units, locations, etc. The URL https://www.apartments.com/new york-ny/1300-to-1500/3/ has 25 property listings and the following code block collects the details of each property.
The following snippet first scrapes the basic information of all 25 advertisements, and then, it iterates over each listing only if its details has not already been scraped:
The individual property details comprise the address, rent, number of bedrooms/ bathrooms, pet policy, amenities, features, schools, colleges, transportation, and many such details that are available. An example is displayed below:
The data has key/value structure, but the value can be a number, string, dictionary, or list. Imagine if you stored such information in a tabular form. As there is no uniform pattern or ‘structure’ in the data, converting it into a table format would be a nightmare. Due to the presence of a ‘nested’ categorization of the data and details, a SQL schema or a dataframe may keep raising various errors due to a mismatch in column names or values (I tried it and it was messy). The document database has provided a suitable schema for storing such information.
Cost of housing
Because we have worked so hard to reach here, we must explore at least some basic details as our reward. Using the script I just described, I have downloaded the data for New York City, Los Angeles, Miami, and Chicago. Each city has roughly 10,000–12,000 properties listed. The maximum rent distribution is shown below, with markers on x-axis indicating the simple average of the upper range of the rents. Chicago appears to be the cheapest place to live, whereas LA remains the costliest among the four.
Scheduling the script
While downloading the data only once has significant benefits, the scheduling package in Python allows you to repeat the task every day at a particular time. Storing the property details over period of time for a city can potentially be used for trend or time-series analyses. I have not shown here how it can be done, but you can read the documentation to get started.
Concluding thoughts
Obtaining data is only an initial step towards modeling and drawing meaningful insights. A variety of data continues to remain unstructured. This blog shows that how one can easily download the unstructured property information and save them conveniently in a document database. The relevant information can be retrieved from the database later as and when required for modeling. In a upcoming blog post, I will use the data of a few large cities to see whether there are any interesting insights that we can find from the property data.