Boston Housing Market Insights

Published in

INST414: Data Science Techniques

17 min readDec 18, 2023

Source: https://www.bostonmagazine.com/property/2022/06/28/boston-home-income/

Decisions

Understanding the housing market is an essential step for making well-informed decisions that can significantly impact one’s financial future.The goal of our project is to analyze the Boston housing market data, extracting valuable insights that not only enhance our understanding but also serve to enlighten others navigating the complexities of the real estate market. While conventional factors such as the number of bedrooms, location, and property size have considerable influence over housing prices, our focus extends beyond the obvious. We aim to identify hidden variables, whose influence can unexpectedly shape housing trends.

The housing market is underscored by the myriad factors contributing to price fluctuations. Our objective is to pinpoint these less obvious variables, such as proximity to lakes or crime rates, which may not be apparent but wield a substantial impact on housing prices. By highlighting these unexpected influencers, we aim to empower our target audience — home buyers and realtors — with a comprehensive understanding that transcends conventional wisdom. It is crucial to acknowledge that the housing market is inherently variable, often relying on estimations rather than concrete certainties. However, therein lies an opportunity to identify patterns and isolate variables that prove most useful in predicting housing prices. For instance, suppose our research reveals a correlation between a property’s age and its pricing. In that case, a prospective home buyer armed with this information might strategically opt for an older house, securing more square footage or amenities for the same financial investment. This is just one example of how our endeavor seeks to translate data into practical insights that users can leverage to secure favorable deals in the housing market. We want to provide a tool that aids in decision-making processes, allowing users to navigate the real estate landscape with confidence. We strive to provide not only clarity but also a sense of direction for those seeking to make prudent real estate decisions. Through our analysis, we aim to contribute to the broader understanding of the housing market. We believe that by uncovering the nuances of hidden variables, users can optimize their decision-making processes. Armed with this knowledge, individuals can align their preferences and financial goals with the ever-changing dynamics of the real estate market.

By focusing on variables that extend beyond the obvious, we hope to offer a valuable resource for home buyers and realtors alike. In the realm of real estate, where decisions carry significant financial implications, our analysis seeks to provide clarity, guidance, and a foundation for making more informed and strategic choices.

Data Exploration

In trying to understand the intricacies of the Boston housing market, we utilized insights derived from two distinct datasets. The cornerstone of our investigation was a dataset from Kaggle; it was a CSV file that included various aspects of the Boston housing market. This initial dataset, complete with niche information, played a pivotal role in shaping the nature of our research. However, recognizing the dynamic nature of the real estate landscape and the need for a comprehensive understanding, we sought to augment our dataset with an additional dataset. To achieve this, we turned to RentSmart. This dataset not only focused on the Boston housing market but also included data from Boston’s City Inspectional Services Division. This strategic decision to incorporate data from multiple sources was driven by our need to analyze a broader spectrum of factors influencing the housing market. By blending these two datasets, we aimed to create a robust and comprehensive foundation that would enable us to glean insights not readily apparent from a single source.

The process of merging these datasets had its own set of challenges. The Kaggle dataset, while rich in content, was problematic due to its inclusion of null values. We undertook a meticulous data cleaning process in order to rectify the pitfalls of the incomplete data. Null values, viewed as potential sources of distortion in our findings, were systematically addressed by removing the corresponding entries. This approach was imperative to ensure the accuracy and reliability of our dataset, laying the groundwork for a more meaningful analysis. After rectifying the issue of null values, we turned our attention to the unnecessary columns. Although seemingly innocuous, these extraneous columns had the potential to introduce noise into our analysis, obscuring the clarity of our findings. Our commitment to precision and accuracy led us to drop columns that did not significantly contribute to our research objectives. By doing so, we aimed to streamline our dataset, focusing only the most important variables and enhancing the interpretability of our findings. The rationale behind this rigorous data cleaning process was rooted in our dedication to producing insights that were not only comprehensive but also actionable. In the dynamic realm of the Boston housing market, where decisions carry significant financial implications, the integrity of our dataset was key. Addressing issues related to null values and extraneous columns was not just a procedural step; it was a strategic move to fortify the completeness of our analysis and provide a reliable foundation for our conclusions. Moreover, our commitment to data cleanliness went beyond the elimination of null values and unnecessary columns. We recognized that even seemingly inconspicuous elements within the dataset could introduce subtle distortions into our analysis. Adopting a proactive stance, we scrutinized each element, ensuring that it contributed meaningfully to our overarching research goals. This holistic approach not only improved the accuracy of our findings but also facilitated a clearer interpretation of the dynamics inherent to the Boston housing market.

Finally, our exploration of the Boston housing market was characterized by a deliberate and comprehensive approach to data integration and cleaning. The amalgamation of datasets from Kaggle and RentSmart, coupled with meticulous attention to null values and unnecessary columns, resulted in a complete dataset. This dataset, refined to enhance reliability, laid the foundation for a nuanced analysis of the factors influencing the Boston housing market. Our commitment to data integrity and precision was driven by the overarching goal of providing valuable insights to stakeholders navigating the Boston real estate market. Through this meticulous approach, we aimed to contribute not only to the body of knowledge in real estate but also to help inform decision-making in the dynamic landscape of the Boston housing market.

Key Ideas

Our project serves as an in-depth exploration into the Boston housing market, utilizing key concepts discussed in this course to predict property values. This sort of predictive analysis is centered around the core principles of data hygiene/cleaning, similarity, dimension reduction, and model evaluation.During the project’s initial phase, the foundation for the project was created through the application of data hygiene and cleaning. This concept involves the process of carefully preparing and cleansing the dataset to ensure its accuracy and reliability. In essence, it is like tidying up a messy room before engaging in a meaningful task.

In our specific context, the dataset initially presented a challenge with missing values, a potential source of error that could significantly impact subsequent calculations. Data hygiene, in this scenario, required us to identify and rectify missing or inaccurate data points. Imputing missing values and scrutinizing the dataset for outliers became important steps in this cleansing process, similar to how ensuring the clarity of a lens before capturing a photograph creates better images. This meticulous attention to data hygiene laid the groundwork for a dependable dataset, forming the foundation for which the subsequent analyses were built.

Image from a part of our script showing cleaning techniques we used for our data.

As we navigated the multifaceted world of housing data, two concepts — similarity and dimension reduction — proved instrumental in simplifying complexities and extracting meaningful insights. Similarity, in the realm of data science, refers to the quantification of how alike two data points are. In our project, this concept was pivotal in normalizing the inherent variability among houses. Despite differences in attributes such as square footage or location, the algorithm aimed to identify and measure the similarities between seemingly dissimilar properties. By doing so, it created a foundation for predicting housing prices based on the shared characteristics among diverse properties with inherently diverse characteristics. Dimension reduction, on the other hand, involves simplifying the dataset by reducing the number of variables or features. It is similar to summarizing a complex novel into its key themes. In our project, the great number of variables potentially influencing housing prices posed a challenge. The objective was to distill this complexity into a manageable subset while retaining the most relevant information. Techniques such as feature importance analysis were employed to identify and retain variables that contributed most significantly to the predictive power of our model.

A key aspect of our algorithm was the application of linear regression — an approach within the broader category of predictive modeling. Linear regression, a staple in statistical modeling, seeks to establish a linear relationship between one or more independent variables and a dependent variable. It is the analytical version of drawing the best-fitting line through a scatter plot of data points. In our context, the dependent variable was the property price, and the independent variables were carefully selected attributes deemed influential in predicting this price. The linear regression model, therefore, aimed to provide insights into how changes in these attributes relate to changes in property prices.

However, the efficacy of linear regression relies on its assumptions. For instance, it assumes a linear relationship between variables, normal distribution of residuals, and independence of observations. Adjustments were made where necessary to balance capturing the nuances of the data and avoiding pitfalls associated with these assumptions. As the predictive model took shape, the question of its effectiveness and reliability arose, leading us to the concept of model evaluation. This stage involves scrutinizing the performance of the model against predefined metrics or criteria. In our approach, we acknowledged the inherent uncertainties in predicting housing prices when done in a practical setting. Exact predictions are extremely difficult and impractical, prompting the categorization of housing prices into groups. This allowed for a nuanced evaluation on both micro and macro scales.

Micro precision involved an in-depth scrutiny of the model’s ability to predict prices accurately within each category. This kind of evaluation provided insights into the model’s performance across different price ranges, highlighting areas of strength and potential refinement. On a macro scale, averages were calculated across the entire dataset, offering a holistic view of the model’s efficacy in predicting housing prices for the Boston area. The regional focus of our model highlights the importance of localized accuracy. Oftentimes, the minority group gets overshadowed by the majority group when performing calculations due to the difference in weight distributions.While a model might perform well on a global scale, its true utility lies in its ability to provide accurate predictions of specific contexts within the Boston housing market. Our model not only demonstrated high micro and macro precision but also showcased its effectiveness in capturing nuances for specific geographic areas within the Boston real estate market. It acted as a valuable tool for stakeholders in the Boston housing market, offering insights into the variables influencing property prices in the local context.

In the pursuit of creating accurate predictive modeling, there were several challenges and considerations that required further attention. One such consideration is the delicate balance between model complexity and interpretability. As we incorporated more variables into our model, the complexity increased, potentially making it more challenging to explain the model’s predictions to stakeholders. Striking the right balance, therefore, involves a thoughtful consideration of the trade-offs between model sophistication and practical utility. Additionally, the dynamic nature of the real estate market poses a challenge in maintaining the relevance of the model over time. Factors influencing housing prices can evolve, and new variables may emerge as significant determinants. Regular updates to the dataset are crucial to ensure that the predictive model remains useful for predicting housing prices.

In conclusion, our project serves as a practical application of key concepts in the realm of data science. From the meticulous groundwork of data hygiene and cleaning to the strategic deployment of similarity, dimension reduction, and model evaluation, each concept played a distinct yet interconnected role in the development of an accurate and relevant predictive model for housing prices in the Boston area. The interplay of these concepts not only enhanced the accuracy of our predictions but also underscored the importance of a systematic and thoughtful approach to data analysis in real-world scenarios.

Analysis

The actionable insight that we aimed to extract was figuring out what non obvious factors most heavily contributed to housing prices in the Boston area. To draw these conclusions, we used linear regression, as well as KNN regression.

In the first dataset, “mass_house.csv”, we used strictly linear regression modeling. We first needed to drop all the rows with null values in order to prepare the data. The target variable that we selected was “pct_renters”, or the amount of people renting houses in Boston. The predictive variable was the number of “Households” in the city. We then needed to split the dataset into two parts: the training set and the testing set. Then, we created a new linear regression model. This model was trained using the training set through the “fit” method, and then the testing set was used to predict the target variable using the “predict” method. Once we did this, we were able to finally plot the regression line. The Mean Squared Error (MSE) was then calculated and printed as a performance metric. This linear regression model was applied in order to predict the number of actual renters, versus the number of households. The regression line, in this situation, demonstrates the relationship between the two.

Linear Regression Graph Indicating the % of renters and households

The second dataset was named “boston.csv”. For this dataset not only did we use linear regression, but we also used KNN regression. To create linear regression, we approached this similarly to the previous dataset. We needed to drop all the rows where the “CHAS” column has zero values. We then selected “TAX” as the target variable, and “AGE” as the predictive variable. We then proceeded to split the dataset in two, similar to the previous dataset. We created half of the data for testing and the other half for training. After creating the regression model and training it, we then predict the values of the variable in the test set. The performance metrics that we calculate and print are R Squared, Adjusted R Squared, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). This regression model simulates the relationship between the tax paid on each house, versus how old the house is.

In addition to linear regression, we also used KNN regression in this dataset. We created a KNeighborsRegressior class, and specified the number of neighbors to be 5; which is adjustable. We split this data and create a training set and a testing set. In order to visualize this data, we created bar plots for both KNN regression and linear regression. These bar plots are created to compare the actual values with the predicted values. We are able to compare the same performance metrics from linear regression with the KNN regression values. In order to predict the “TAX” paid based on the “AGE”, we compare the values of each regression method.

Snippet from our code showing our KNN model.

Bar graph comparing the predicted Tax with the usage of LR and KNN

In this overall analysis, we used linear regression to capture direct relationships between values, and KNN regression to capture more complex patterns. The performance metrics calculated provide insights into the accuracy of the models.

Stakeholders

In this situation, the stakeholders are potential homebuyers, and realtors looking to sell specific houses. This analysis of the Boston Housing Market serves to help these stakeholders make informed decisions when it comes to buying and selling houses. We aim to offer a nuanced understanding of the housing market in Boston, that goes beyond the conventional factors that you would expect. The dataset undergoes rigorous cleaning, addressing null values and dropping unnecessary columns to enhance accuracy. The inclusion of diverse data sources allows for a holistic exploration of potential variables that are influencing housing prices. This data cleaning ensures that the dataset in use is accurate and reliable.

The concepts of similarity and dimension reduction prove to be instrumental in simplifying the complexities of housing data. Similarity reduction quantifies the likeness between data points, enabling the algorithm to identify shared characteristics among diverse properties. Dimension reduction involves breaking down the dataset to a manageable subset, while still retaining all the information that is relevant. Linear regression is also employed to establish relationships between independent variables and housing prices. This modeling approach is similar to drawing the best-fitting line through a scatter plot, offering insights into how changes in different factors relate to changes in property prices. Model evaluation becomes crucial to gauge the effectiveness and reliability of predictions. Acknowledging the evident uncertainties in predicting housing prices, the analysis categorizes prices into groups, allowing for detailed evaluation on both micro and macro scales. Micro precision involves a detailed description of the model’s ability to predict prices accurately within each category. This provides insights into the model’s performance across different price ranges, highlighting areas of strength and potential improvement. On a macro scale, averages are calculated across the entire dataset, offering a holistic view of the model’s efficiency in predicting housing prices for the Boston area. We also have a large emphasis on localized accuracy, since the real estate market is extremely diverse. Performing well on a global scale is not enough, we wanted to make sure it was able to be utilized for the Boston housing market specifically.

The dynamic nature of the real estate market poses challenges in maintaining the model’s relevance over time. Regular updates to the dataset are needed to ensure the model remains useful for predicting housing prices in a constantly evolving market. Balancing model complexity with interpretability is a fine line to tread. While a sophisticated model captures data, it’s essential to strike the right balance in order to have effective communication of predictions to stakeholders.

In conclusion, this analysis represents a practical application of key data science concepts in the realm of real estate. It provides stakeholders with actionable insights, derived from diverse modeling and evaluation techniques. The interplay of data hygiene, similarity, dimension reduction, and model evaluation not only enhances the accuracy of predictions but also underscores the importance of a holistic understanding of the Boston housing market. Stakeholders can leverage these insights to navigate the complexities of the market, make informed decisions, and optimize their real estate endeavors in the Boston area.

Conclusion

In the course of our complete project, one aspect that we thought warrants careful consideration are the limitations stemming from our deliberate choice of focusing exclusively on the housing market in Boston. This strategic decision allowed us an in-depth exploration of the Boston real estate landscape, it simultaneously imposes a constraint on the broader applicability of our findings. The insights derived from our data analysis and model training are inherently tailored to the unique dynamics of the Boston housing market.

The primary implication of this limitation is that the practical utility of our project is inherently localized. The predictive models we used and the trends uncovered through our methodologies are most relevant to users with specific interests in purchasing properties in the Boston area. This specificity is vital for prospective homebuyers or investors looking to navigate the nuances of the Boston real estate landscape. However, it is crucial to acknowledge that our project’s insights might not seamlessly translate to other areas in the U.S. due to the distinct regional factors that influence housing markets. Nonetheless, despite the limitations, we firmly believe in the transferability of the methods we used to a broader audience. The code and data techniques we’ve documented in our GitHub repository serve as a valuable resource for those who may want to uncover hidden variables that are dictating the housing prices in different locations. By using our methods to encapsulate diverse datasets from various regions, users can potentially glean valuable insights into their respective housing markets. Another noteworthy constraint encountered during the project was the limited size of the dataset at our disposal. Our quest for comprehensive and extensive datasets was somewhat difficult as certain reputable sources were unable to provide information due to unspecified reasons. This constraint on the dataset’s size could potentially impact the freshness and generalizability of our models. Larger datasets would have afforded a more better understanding of housing market dynamics, potentially capturing a broader spectrum of influencing factors that in turn can be used to figure out potential applications and trends.

In navigating our vast landscape of data analysis methods, our project strategically honed in on two prominent techniques — Linear Regression and K-Nearest Neighbors (KNN). This intentional focus on these methods serves as the cornerstone of our project framework, yet it’s critical to acknowledge the limitations associated with this area of data analysis. Linear Regression is a widely employed predictive modeling technique, assumes a linear relationship between the independent and dependent variables. This assumption, while important, may prove to be restrictive in capturing the complicated and non-linear dynamics that often characterize real-world housing markets. The oversimplification in linear models might lead to an underrepresentation of complex patterns. This in turn can cause potential inaccuracies in predicting housing prices, especially when dealing with multifaceted influences such as seasonality or market fluctuations. Moreover, Linear Regression is really sensitive to outliers. The presence of extreme data points, which are mostly common in real estate datasets, can disproportionately influence the regression line. This in turn causes the potential of skewed predictions. This attention to outliers becomes particularly crucial when dealing with housing

With all this information, it’s important to understand the safe practices of data collecting and using the data in manners that cannot be used for ill intent. This can include things like manipulating data to fit a certain agenda or to cause confusion or misrepresentation in the public. The cause of manipulating data to fit a certain narrative not only compromises the integrity of the results, but also huge implications for the public and stakeholders’ trust. In the world of housing market analysis, these manipulations could lead to skewed perceptions of property values and potentially impact real estate decisions and market dynamics. We believe it’s important for us and everybody to be ethical when handling data. This especially is important in big data jobs that have a constant number of employees all working and collecting data together. We believe these values should be upheld and protected against both internal and external forces. The ethics aspect of any project or data collection is worth mentioning especially if the data is not accessible to the public for use.

This project was an eye opener into using all the data science techniques we learned in this class. It taught us the real world of development and collection within data science and its jobs. Having this project being broken into multiple sections over the semester was beneficial. The deliberate order of breaking down the project proved to be highly beneficial for us. This approach for us helped our understanding of the sequential steps involved in a data science project, by mirroring the real developmental process encountered in professional and corporate jobs. The separation of project parts allowed us to dive deeply into each step, from finding our data to cleaning and analyzing it. This built a more comprehensive understanding of all the intricacies in handling real-world data science challenges.

As mentioned before, we believe this project can help provide insights to anyone who is currently trying to find properties in Boston. We believe this project and its data can be used as an example or even a template for others that may be interested in the housing market. We also believe that there should be a global or even state run organization with information on properties and their values. This will help others in the future who may want to find connections between factors and the property prices without having to go to other non-reputable sources or missing/incomplete data. By having this happen we can be positive that more reports on the housing market will happen and maybe the exploration of this topic can grow, and further develop certain methods to see what properties might be more or less valuable in the future. We believe it’s imperative to know this information especially if the property will be going down, this can help investors and buyers before making a permanent financial decision. We hope to have helped users who may be looking for properties in Boston or any source of inspiration in the housing market.

Team Members: Anish Gandhi, Glory Ndu, Rohit Priyakumar, Lauren Rhoades, Yuvan Sundrani, and Aziz Vohra.

Code for this project below.

GitHub repository: https://github.com/Avohra980/INST414_Final_Project.git