Going through the Data Science Process and the Ames, IA Housing Dataset

Roy Kim
6 min readJan 3, 2019

--

Predicting house prices in Ames, Iowa

On my Data Science journey, there were a few projects that I embarked on to learn and improve my analytical skills. The first of these projects was to analyze SAT and ACT data and make a fictional recommendation to the College Board concerning what state to pursue increasing SAT participation. This first project was extremely helpful in learning the basics of Exporatory Data Analysis, as well as drawing conclusions from data. The second project that I tackled concerned the Ames, Iowa housing data set. It was through this project that I got my first expanded taste on modeling with various linear regression techniques.

In the Data Science process, there are six definitive steps to solve a problem (though not always progressed linearly). They are the following:
1. Defining the problem
2. Gathering data
3. Exploring the data
4. Modeling with the data
5. Evaluating the data
6. Answering the problem
I like to use the acronym DGEMEA to remember these steps (just kidding, no I don’t actually use this acronym). In the process, each step is vital in arriving at the final solution. I will use this process as a guide through my analysis of the Ames, Iowa housing dataset.

Defining the problem: How must I frame this Data Science problem into a question statement that is quantifiable, verifiable, and reproducible? How a Data Science problems gets framed into a question will, undoubtedly, guide the quality of the rest of the analysis. That is not to say, however, that a Data Science problem might have to be re-examined at a later time: perhaps the exploratory data analysis might refine the problem, perhaps the evaluation of the data reveals that the wrong problem was asked. For this project, defined my problem statement as the following: What features of a house are most important (have the highest correlation) in predicting its price in the Ames market? Furthermore, can a linear regression model be developed that achieves high accuracy (greater than a baseline R² score of 0.5)?

Gathering the data: Where can I find the data to be gathered? What techniques must be used to gather this data? Does the way that the data is gathered in any way invalidate or otherwise impact the outcome of the analysis? Perhaps this step and the next step in the Data Science process take the most time in an average Data Science problem. The purpose of this project was not necessarily to practice gathering data, as the Ames, Iowa housing dataset is readily available on the Kaggle link above. Thus, data gathering was a simple step.

Exploring the data: Does the data need to be cleaned? What insights are gained through an introductory look at the data? What relationships between variables exist? What variables might play an influential role in the modeling process? Taking a deep dive into the data at this step will be invaluable for the rest of the process. In fact, one of the key takeaways from this project was that I needed to spend more time exploring the relationships that exist between variables.

In my own exploratory data analysis (EDA), I realized that the housing dataset had a high number of null (or NA) values. Used a null visualization library called “MissingNo” to help me understand where many of the null values came from.

The many NA values in the dataset seemed to be a problem, but by looking at the data dictionary provided, the dataset used “NA” to mean that the certain entry (house) did not have the feature (for example, an alley). So it wasn’t an issue of bad collection of data, but rather that the naming convention was conflicting with Python’s code. In fact, most of the the missing values were simply replaceable by changing those NA values into a different placeholder value (in my case, I used “None”).

My next step in my EDA was to take a look at the correlations that existed between my target variable (SalePrice) and the rest of the features. I visualized those correlations with a Seaborn heat map.

Using the heat map, I could see that the features that had a correlation to the price of a house greater than 0.6 were ‘Overall Qual’, ‘Total Bsmt SF’, ‘1st Flr SF’, ‘Gr Living Area’, ‘Garage Cars’, and ‘Garage Area’. I used this information to build my first linear model to predict price.

However, in retrospect, I realized that much more exploration of the data was needed. I only used minimal feature engineering in my analysis, but this is where background expertise into real estate would have been helpful the first time around. I should have tried to combine the square footage of a house into one feature (instead of having separate square footage for 1st floor, 2nd floor, basement, kitchen, bedrooms, etc.). Furthermore, I should have combined the number of bathrooms for a total bathroom count. Lastly, I should have manually created some interaction features to explore those correlations, as well.

Modeling the data: Can a highly accurate predictive model be created with the data? What types of models can be used? Can any of the models be optimized? For the project, I used a basic linear regression model, coupled with the Ridge and Lasso optimization techniques. At this point in the project, I used many different sets of features: some with a few of the features, some with all of the features, some with the most impactful features selected (using Recursive Feature Elimination). I also tried the two optimization techniques, though I found Lasso to be more effective. I also tried using GridSearch to optimize the hyperparameters of Lasso optimization.

Evaluating the data: What metric will be used to evaluate the effectiveness or predicting power of the model? What features of the model can be examined to further improve the model? After numerous attempts, my most accurate model achieved an R² score of 0.92 (meaning that 92% of the variance in our target variable, price, can be explained by the features in the model). I used a scatterplot to visualize the actual price and the predicted price from the model.

While the R² score was relatively high, there were some other tactics that I could have tried to increase the predicting power of my model. One such tactic is to apply a log transformation to my predictions. Another tactic is to use the average of a number of different models to get a better overall prediction.

Answering the problem: Does the analysis answer the initial problem? Does the analysis indicate that changes have to be made to the initial problem (or to how data was gathered, analyzed, or modeled)? In my initial problem, I asked what features are important in determining the price of a house. I realized that while there are a few features that have high correlation to the sale price, I could have also done more feature engineering to find features that had higher correlation (and had more relevance than the initial set of features). Furthermore, I was able to build a model with an R² score that was greater than my initial goal, but more can be done to make the model stronger.

Conclusion and Moving Forward: There is always room for improvement. One of the most important lessons learned from this project came to me as I talked to a colleague about the project: A model can only be as good as the data coming in. I realized that the more I got to learn about the data by cleaning, exploring, and engineering, the better my ending model will be. Moving forward, I hope to apply stronger exploratory analysis skills coupled with more powerful models to expand my expertise and proficiency as a Data Scientist. Onwards and upwards!

--

--