How to find good apartment deals Craigslist
Predicting rental prices with linear regression
A year ago my partner and I were looking for a new apartment. We weren’t looking for anything fancy, just a one bedroom that allowed dogs, with an outdoor space for a garden, and didn’t cost an arm-and-a-leg.
If you live in the Bay Area, or likely any hub city in the U.S., you know where this story is going: multiple home viewings in which you wait in a line of prospective renters all vying for the landlord’s attention and favor.
I have never hated perfectly nice strangers more than in this moment.
The competitive nature of apartment hunting means we as seekers need to be ready to act immediately to secure a lease. Speed comes at the expense of careful evaluation and assurance that what we’re getting is a fair deal. With this in mind, I wanted to create a predictive model that would help renters like myself better and more quickly assess their options.
For this project, I scraped a little over 3,000 San Francisco apartment/housing listings on Craigslist and used linear regression to predict rental prices based on a set of standard features available on most postings. I’ll share a high level outline of my web scraping and data cleaning and preparation processes, and then get into my approach for building and evaluating the model.
All of the code and details of this project can be found on my GitHub repository .
Anatomy of a Craigslist post
For those new to apartment hunting on Craigslist, let’s take a quick look at a pretty typical post:
Craigslist listings can vary wildly as posters are only required to provide a title, rent amount, and fill out an open field text box. However, there are additional standard fields that can be utilized: location (for most major cities, there are neighborhood options), square footage, number of bedrooms and bathrooms, and then some basic information about amenities offered like laundry facilities, whether or not pets are allowed, and parking.
The consistency of these fields, both in phrasing and placement on the webpage, means they can be easily scraped and used as features in a model.
Web scraping Craigslist
When I first started this project and was searching for information on scraping Craigslist, I came across this great tutorial for scraping the results listings pages. This article was a helpful start, but I knew I’d need to go into each individual posting to get additional features like number of bathrooms and amenities.
The scraping process:
- Scrape the first page of listings results for date posted, title, url, rent amount, square footage, number of bedrooms, and neighborhood (location) and add it to a pandas DataFrame. There are usually about 120 listings on each page.
- Scrape each of those 120 urls from that listing results page to extract number of bathrooms and the amenities, and append those features to the DataFrame.
- Move to the next results listing page and repeat until all the results pages and individual posts have been scraped.
Here’s one of the helper functions to compile a results page worth of listings:
Data cleaning and preparation
With the listings safe and sound in a DataFrame, the next step is to clean and prepare the data for modeling:
- remove duplicates (in this scrape, there were a little over 600)
- format bedrooms and bathrooms (e.g. convert bathroom entries like ‘splitBa’ to 1)
- parse through the list of amenities and create distinct categories for laundry, pets, and parking
- reduce number of locations to a workable number of categories
- and finally, drop rows with missing or bad data (e.g. wildly incorrect square footage figures, listings for apartments that are not in San Francisco, etc.), and outliers (e.g. a listing for an apartment that was over $25k/month! Who are you and why are you looking on Craigslist anyways?)
Here’s an example of searching through the amenities list to parse for pet restrictions:
if ('dogs are OK - wooof' and 'cats are OK - purrr') in amen_list:
return '(a) both'
elif 'dogs are OK - wooof' in amen_list:
return '(b) dogs'
elif 'cats are OK - purrr' in amen_list:
return '(c) cats'
return '(d) no pets'sf['pets'] = sf['amens_list'].apply(lambda amen_list pets_allowed(amen_list))
As I mentioned earlier, Craigslist allows posters to assign the location of their offering from a list of standard neighborhoods. For San Francisco, there are about 35 unique neighborhoods to choose from. Posters also have the option of using an open field text box to provide their own location details. As a result, there were over a 100 unique values for location in the DataFrame — many of them duplicates (e.g. ‘soma’, ‘SoMa’, ‘SOMA’, and ‘South of Market’), and many of them not in San Francisco (sometimes not even in the Bay Area).
The San Francisco Association of Realtors has provided a map dividing the city into 10 districts, and Data SF created an interactive version that allows users to easily find a district based on smaller neighborhood designations.
With this map as my guide, I was able to reduce the number of unique location values to these 10 districts, which would significantly reduce the number of dummy variables used in the model later.
After a thorough scrub, the dataset was whittled down to just under 1,000 rows and 8 features that could be used to help predict rental prices.
Model build, results, and evaluation
Exploratory data analysis
Before modeling, it’s important to explore the data and understand the relationships between possible features and the target variable.
Below, we can see that the laundry facilities offered in these listings is positively correlated with price, with more ideal or preferred options associated with higher rents.
Similarly, parking availability may also command a premium. As anyone in the Bay Area knows, it’s not a matter of if your car gets broken into — it’s when.
Surprisingly, building type — whether it was a single residence home or multi (e.g. duplex, apartment buildings) — were not strongly correlated with price. Thus, this feature didn’t make it into the model.
Feature selection and engineering
The final features used in the model could be categorized as:
- continuous (sqft)
- discrete (number of bedrooms, bathrooms)
- ordinal (parking, laundry), and
- categorical (pet restrictions, neighborhood district)
To improve the model’s predictive power, I used sklearn’s PolynomialFeatures, which would generate interactions among the different features. Then I applied Ridge and Lasso regularization to reduce complexity, prevent overfitting, and tease out which features had the biggest impact on rental price.
After a rigorous cross-validation process using K-folds, both scored similarly on R-squared, however the Ridge model scored most consistently between training and validation datasets on root mean squared error (RMSE) and mean absolute error (MAE) metrics, and so I moved forward with Ridge for the final model.
Final model and results
After fitting the model with the full training set, the model’s test scored reasonably well:
Root Mean Squared Error (RMSE): $477
Mean Absolute Error (MAE): $366
The plotted prediction prices compared with actual values is shown below:
The fitted prediction line almost aligns perfectly with the identity line, starting to break away somewhere around the $4,500/month range. The residual plot provides a better view:
As most of the listings in the sample fell in the $2,500 to $4,000/month range, it seems reasonable that the model performs best in this window. The distribution of prices for the full sample can be seen below:
Final thoughts and suggestions for future work
Ultimately, I think the model does a pretty good job of predicting rental prices up to around $4,000/month. Tolerance for the ~$400 swing (via RMSE and MAE) should be considered in relation to the rental price and will be subjective to the seeker’s budget.
For example, a $400 delta might be significant for a single person, but may be completely acceptable when splitting between two or more people.
Still, there are likely other factors influencing rental prices:
- rent control and history (especially in the Bay Area)
- other amenities offered but not standard in Craigslist postings (e.g. fitness facilities, common spaces, bike storage, etc)
- other environmental or economic factors
On the last point, we’re currently seeing an unprecedented decline in San Francisco rents due to COVID migration. Those moving out of the city may be people who have the means to move and work remotely during the pandemic. Thus, the sample of the listings scraped in early October might not be the best representation of Craigslist’s normal offerings.
Suggestions for future work
I would love to take this project farther and improve the model’s predictive power. Here are some ideas for future work:
- sample again once COVID’s influence subsides
- increase scope of scraping methods to find/create additional features (search body of text for key words like ‘gym’ or ‘backyard’ or find information about security deposit and leasing requirements)
- implement image processing tools to evaluate a unit’s quality through posted photos
- scam post identification
I’d also love to hear from you if you have additional ideas for making this model more precise or if you’re interested in adapting it to your own city or town — let’s talk!