Housing Prices and Affordability: A Comprehensive Analysis for Informed Home Buying Decisions

James van Doorn
INST414: Data Science Techniques
13 min readDec 18, 2023

By: James van Doorn, Colin Clifford, Amaar Mir, and Tsegaye Umer

Decision Supported & Actionable Insight

The problem we are solving is the difficulty upcoming homebuyers have with finding suitable real-estate for their needs. We plan to do this by predicting real-estate prices across neighborhoods or clusters of similar houses, so a homebuyer can assess whether this particular neighborhood or type of house is likely to be a good long-term investment.

In order to assist upcoming homebuyers, our main insight is to compile all available information on housing costs and compare it to ones of previous generations. This will not only show us the change, but it will also provide us with all the parameters and reasons as to why the changes are occurring. One other insight we will gain would also be what factors are the most important in depreciating or appreciating prices, and we can use this information to predict trends for locations in which these parameters are also present and could cause further depreciation. One final insight would be taking in account all other properties and how their costs have appreciated and depreciated as time passes. This would show the validity of purchasing a property and seeing the long term investment value. Using all this information will allow us to create a model that can inform home buyers and help them come to a conclusion based on all the factors presented.

As many years go by, technology has continued to impress us on its capabilities. With those improvements comes more ways for people who are interested in purchasing real estate to find reasons why not to purchase a specific property. Buyers are now more educated than ever before and it is important to provide them with all the insights and information needed for them to make a safe and satisfying purchase. It is no secret that housing costs have fluctuated dramatically in the past couple decades, therefore providing potential buyers with all the information they could possibly imagine would benefit the buyer immensely.

The main metric that we have used to measure the success of this project has been the quality of the insights that we have generated. We have made sure that the insight we generated based on this project makes sense and is valid. Another metric that we have kept in consideration to measure success is how easily we see our insights and information from our project being understood by our target audience. This is still something we are exploring; we will most likely measure this through survey data about housing prices.

Data Exploration & Processing

Initially, we searched for APIs on housing prices, however, we encountered paywalls for these, so we decided to look for free datasets. We found free datasets on Zillow, Kaggle, and the Federal Housing Finance Agency. In order to provide the most value to our target audience of those looking into buying new homes, we decided to look at both aggregate datasets and datasets of individual housing prices and specs.

The aggregate data can be used to find housing price trends in different areas so new homebuyers can determine which locations will be the best long-term investments when considering factors such as appreciability and others. The more granular or specific data can help new homebuyers look through and determine more specifically what type and location of housing they should choose. This granular data can be aggregated in order to combine more easily with the already aggregated data we’ve found.

We started our data collection with a Google search and from there we investigated any sites we thought looked useful for our project. Along the way, we have considered how the datasets we found could be used in combination with each other and the analyses that could be done.

Currently, we are storing our data in flat csv format. We decided on this format because csv files are relatively easy to use and access using the pandas module in Python.

In Zillow’s housing price CSV file, we found that some values are missing and left blank. We removed entries that have no values and are missing after filtering the year range we wanted to analyze. We visualized the data using a histogram to see if there was a trend or pattern. Then, we calculated new mean values for each month’s housing prices. On the Zillow CSV file, the histogram shows a similar pattern in each state (the house prices have an upward trend through 2018–2023). The mean value that we added to the data will help us analyze and compare the prices in each state.

Using the pandas module in Python, we explored the Federal Housing Finance Agency (FHFA) and Kaggle real estate datasets. Using df.dropna, we dropped the rows from both datasets where at least one element was missing so that we will be able to more easily create visualizations and conduct analyses on the data. As with the Zillow dataset, we first limited our scope by removing rows in each datasets that had years prior to 2018. For each of the datasets, we needed to use pandas.to_datetime in order to convert the years to datetime format so that they could be filtered to only include years 2018 to 2023. However, we ultimately decided not to use this dataset as we did not see it being helpful to our analysis.

After exploring and testing out different datasets and sources over the course of the semester, we settled on two datasets, which we combined using Pandas. The first dataset is the Zillow dataset, which we mentioned earlier. The second dataset is a csv from World Population Review. This dataset is simple and only contains two columns, the first containing names of states in the United States and the second being the median income of that state. This data was offered in both a json format and a CSV format, however, we opted for the CSV in order to be consistent and so we could merge it with the CSV dataset from Zillow.

One thing that we decided while gathering and building our data collection was that we needed metrics that would be meaningful for our topic. After combining the datasets, we decided to create a new column called affordability which came from dividing the state average housing prices from the Zillow dataset by the state median household incomes from the World Population Review dataset. This means that the higher the affordability index is, the less affordable the median housing price is in a given U.S. state.

Key Course Ideas and Rationale

In our project, we built upon many key ideas offered by the course. First and foremost, we built upon the principles of data science, specifically in the discovery of insights. These include:

  • Validity, meaning that the insight should hold onto new data with some certainty. Our data exploration, cleaning, and analysis practices can be applied repeatedly on any new data that is collected or identified with success.
  • Usefulness, meaning that the insight should be actionable in some way. First, we identified a topic that we were all interested in, which ended up being housing affordability. We then brainstormed to find the implications of housing affordability on our own lives. As young adults, we hope to buy housing sometime in the not-so-distant future, and that decision will depend on the affordability of housing. Thus, we decided to find out where homes are the most affordable. This allows our target audience to purchase a house that is within their price range.
  • Unexpectedness, meaning that the insight is non-obvious. Without doing an analysis, it is hard to guess where homes would be the most affordable, exactly. In addition, affordability looks different for every person and place. We took this into consideration in carrying out our analysis.
  • Understandability, meaning that humans should be able to interpret the pattern or model providing the insight. In order to set the groundwork for interpretable analysis results, we first cleaned and organized our data.

For this project, as this class used Python to analyze datasets, we have used Jupyter Notebooks and Google Colab as our primary platforms for our code and analysis. Similar to our coursework, we have used the Pandas library in order to read, organize, merge, and clean our data. For data visualization, we made use of Matplotlib and Seaborn. In order to include the course idea of clustering, we also used the Scikit-Learn library, assisting in the analysis of house prices and affordability so that we can help our intended audience make an informed decision.

Throughout this project, we have kept in mind data principles that we have been taught in class. For example, we have considered the concept of “Bad Data,” or data that is skewed, dependent, distorted, or biased. We have used the practices we have learned to avoid bad data, in that we dropped invalid, incomplete, and irrelevant data, including both rows and columns, from our datasets. The initial structure of our data and the outputs of our analyses did not make the most sense, and we worked at fixing these with practices of data validation and data cleaning that was mentioned previously. On a similar note, we also took into account the five aspects of traditional data quality, which are accuracy, completeness, uniqueness, timeliness, and consistency.

  • Timeliness: The element of timeliness was the easiest to address, as data often is labeled with its time range. To incorporate timeliness, we limited our data to the past 5 years (2018–2023), in order to generate analyses on only the most recent and therefore relevant data.
  • Consistency: Consistency was another aspect we kept in mind throughout the project. To make our data more consistent, we eliminated rows with missing data and rows in which a zero was filled in where there should have been an empty cell. This allowed us to reduce the degree to which our data was skewed in one way or another.
  • Accuracy: Accuracy was at the forefront of our work. There is a great volume of data on the web, however, the accuracy of this data varies greatly. In order to find the most accurate data, we searched the web and cross-referenced different sites and data sources, and after some deliberation and discussion, we settled on our datasets.
  • Completeness: For our project, our data’s completeness ties into its timeliness. Considering the last five years gives us assurance that our data has a satisfactory level of completeness.
  • Uniqueness: As we did with accuracy, we searched the web and cross-referenced different sites and data sources to find the most unique data.

In addition to the data principles we documented, the main key idea from the course that we have built upon is clustering, specifically K-Means clustering. We decided to apply this method for a number of reasons. As a machine learning technique, clustering groups similar data points together based on certain features, and this fit into our goals of clustering based on prices and incomes. Specifically, we used KMeans clustering to identify patterns and groupings within our dataset, exploring how regions with similar characteristics in terms of housing prices and income levels cluster together. Another reason we chose clustering as a technique is that it allows for easy visualization and therefore understanding of the results of different clustering and analyses. The ultimate goal of our project is to provide actionable insights for new homebuyers. Insights are the most actionable when they are easily understood and communicated.

Analysis Supporting Actionable Insight

Data Cleaning:

We initiated the analysis by meticulously cleaning the dataset, ensuring the removal of any blank spaces for data integrity.

Feature Engineering:

A critical enhancement was the introduction of the ‘StateAverage’ column, enriching our ability to analyze housing price data effectively.

Incorporating Median Income Data:

For a more comprehensive view, we incorporated median income information from a separate dataset, specifically tailored for each state.

Affordability Metric:

Affordability was derived by dividing the ‘StateAverage’ by the median income information. This standardized metric provided a clear measure of affordability across diverse regions.

Scatterplot and Bar Chart:

Visual aids played a pivotal role in our analysis. A scatterplot was constructed to visualize the relationship between StateAverage, Median Income, and Affordability. Moreover, the bar chart was designed with region names, focusing on the top 20 regions in each cluster, alongside affordability information and cluster categorization.

KMeans Clustering:

The primary modeling technique applied was KMeans clustering, an unsupervised machine learning algorithm. This facilitated the grouping of regions into clusters based on their housing price features.

Optimal Cluster Selection:

The selection of an optimal number of clusters was determined using techniques such as the elbow method, ensuring the resulting clusters were meaningful and representative.

Affordability Ranges:

Clusters were categorized based on affordability ranges, with Cluster 0 identified as the most affordable (Affordability Range: 0–3), progressing to Cluster 3 as the next most affordable (Affordability Range: 3–5), and so forth. Each cluster’s characteristics were thoroughly analyzed, pinpointing specific regions within each cluster. For instance, Cluster 0 featured regions like Jackson, MS; Vicksburg, MS; Tupelo, MS; Harrison, AR; Silver City, NM; and Cleveland, MS.

Communication of Insights (recommendations):

Clear recommendations were provided based on the affordability insights. For example, first-time home buyers were strongly recommended to consider purchasing homes in Cluster 0 due to its low affordability values and the favorable regions highlighted in the bar chart. Cluster 2 (Affordability Range: 9–15) was identified as the least affordable cluster with the highest average housing prices. Recommendations for First-time Home Buyers: This cluster, including regions like Jackson, WY; Vineyard Haven, MA; and San Francisco, CA, is characterized by high housing costs. It is not recommended for first-time home buyers seeking affordability.

Coherence of Clusters:

The effectiveness of the KMeans clustering model was assessed based on the coherence and meaningfulness of the resulting clusters. Regions within the same cluster were expected to share similar housing price features.

Informed Decision-Making:

The analysis aimed to empower first-time home buyers with the information needed to make informed decisions. The bar chart, specifically showcasing the top 20 regions in each cluster with affordability info, was instrumental in enhancing stakeholders’ understanding of the overall affordability landscape.

In summary, the analysis encompassed a comprehensive approach, leveraging insightful visualizations, and focusing on the top 20 regions in each cluster in the bar chart to provide actionable recommendations for first-time home buyers.

Answer for Stakeholders

Our analysis provides us with multiple answers for stakeholders. One of the more important answers that we have formulated would be questions regarding the trends of affordability in these regions. Our analysis shows us a visual representation of the popular regions and how their trends stack up against other popular locations. This is important as our data outlines 5 years of data which then a consumer would compare this data and come up with a strong prediction about what regions are slated to boom or bust. Most importantly, allowing our consumers to visually see these trends and patterns would also allow for educated decisions about where they should settle down. There is no question that future home buyers need to take in consideration not just the current prices of real estate, but also their personal income and assets. When deciding on a place to live, the ability to fund this purchase is a big factor. For this reason we have incorporated annual median income which provides a brief representation of how much people in a specific region are willing or able to pay in order to sustain their conditions. This would allow for a consumer to properly select a region in which they would feel the most comfortable knowing the cost to live is well within their budget. There are many other potential answers to be uncovered from this data and these factors should ultimately influence the decisions that future home-buyers are planning to make. A visualization of clusters will be provided below with explanations.

The above scatterplot show clusters and their affordability:

Cluster 0: blue

Affordability: High(lowest price)

Cluster 3:orange

Affordability: Moderate to high(low to moderate housing prices)

Cluster 1: purple

Affordability: Moderate

Cluster 4: yellow

Affordability: Moderate to low(High housing prices)

Cluster 2: pink

Affordability: Relatively low(expensive)

Limitations

Undoubtedly, although we feel the data and analysis thereof have produced valid, actionable insights into the current trends of the housing market, it has its limitations. The first of these limitations is in regards to the size of our data. The data used for this analysis only accounts for 882 different cities in the United States. Therefore, although we can generalize about the cost of living in certain cities, we cannot accurately speak to the housing price fluctuations on a neighborhood-by-neighborhood basis in this analysis. Surely, this information is highly valuable to new homebuyers, so it is unfortunate this analysis does not provide such insight.

Another notable limitation of this analysis lies in the scope of our data. The dataset we used for this analysis ranges from January of 2018 to September of 2023. In the grand scheme of the housing price index, this is only a small blip in time. Although this is likely enough to make accurate predictions and analyses, it is nonetheless a limitation that should be addressed.

In addition to the limitations mentioned, there are some potential ethical concerns for this analysis as well. This analysis takes account solely of housing prices over time. It does not, however, take into consideration the reasons for housing prices per region. There is some potential for discrimination, as some of the regions deemed “undesirable” may be inhabited largely by individuals that belong to a single ethnic/racial background. We do not intend to make any suggestions that these regions should or should not be taken into consideration by new homebuyers. We simply hope this analysis provides helpful insight into the current housing price trends in the US, so individuals can estimate how much it would cost to live in their desired location.

Another limitation is the scope of the project more broadly. Initially, we had also wanted to provide potential solutions for housing unaffordability. However, we realized that went beyond the scope of our project, especially considering our experience levels, expertise, and the time period available for work. If we had additional time for this project, that would be one area that we would be interested in exploring.

Data Sources:

  1. https://worldpopulationreview.com/state-rankings/median-household-income-by-state
  2. House Price Index Datasets | Federal Housing Finance Agency (fhfa.gov)
  3. https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset/

Appendix:

GitHub Repository: jvand0/inst414_group_project: inst414 group project code and csv files (github.com)

--

--