Flipping Homes with Data Science

May 9 · 7 min read

The housing market has been on fire lately. Houses are selling above asking price for many reasons, such as the shortage in supply due to declining mortgage rates and people being able to get more value for their money now. This and other factors are causing an absolute frenzy in the marketplace. This is inevitably causing home buyers to struggle with finding good deals. For home flippers, even with the deals that they find, the profit margins are starting to get razor thin.

You must spend money if you wish to make money — Plautus

So, how can data science help make sure that the money you spend on renovating a house is money well spent? Or in other words, how can you maximize your home’s potential sale price with renovations? Well, the answer is with a multiple linear regression model.

A multiple linear regression model is a machine learning model that tries to explain the relationship between multiple independent variables and a dependent variable. Think of independent variables as different attributes of the home (such as bedroom count, total square footage or quality of finishes) and the dependent variable as your “target” that you are modelling to predict, which in this case is the home’s sale price.

Where do we begin?

To start, we need to have data to be able to do some data science magic. A lot of counties have their own real estate datasets that are readily available. You can take the most recent set, perform an exploratory data analysis (EDA) and get scrubbing. Once your data is sparkling clean from all the scrubbing, we can move on to building the model.

Did I say building the model? Well… not quite. Before building the model we need to make sure that our data meets the linear regression model’s assumptions:

1. Linearity
2. No multicollinearity
3. Normality of residuals
4. Homoscedasticity of residuals

What is linearity?

Linearity is essentially checking and confirming that the relationship between each of your independent variables compared to your dependent variable is a linear one. To do this, you can run a quick loop to plot these relationships and visually inspect them. Below is a code snippet I used when I was working on a similar project.

`fig, ax = plt.subplots(nrows = len(df.columns), figsize=(10,200))for i, col in enumerate(df.columns):    ax[i].scatter(df[col], df[target])    ax[i].set_xlabel(col)    ax[i].set_ylabel(target)    ax[i].set_title(f”{col} vs. {target}”)`

Once we confirm linearity, we can move on to checking for multicollinearity.

What is multicollinearity?

Multicollinearity means that your independent variables within themselves have a strong correlation with each other. This is a problem because, if the independent variables (in this case, the attributes of the home) are affecting each other, this will introduce noise in your data and you will not be able to hone in on the relationship of these attributes with the target, or in this example, the home’s sale price.

To remove multicollinearity we have some options: we can either selectively remove the heavily correlated columns or run a Ridge or Lasso Regression (which is a whole other topic worthy of its own post). For now, let’s assume we will be removing some of the columns. You can start off by creating a correlation matrix which shows the correlation values of the dataframe you’re working with. This can be obtained by using the code below for your dataframe:

`mask = np.zeros_like(df.drop(‘price’, axis=1).corr())mask[np.triu_indices_from(mask)] = Truefig, ax = plt.subplots(figsize=(15,10))sns.heatmap(df.drop(‘price’, axis=1).corr(),            annot=True, mask=mask, cmap=’Reds’)`

In my project, I went with the removal option and opted to engineer some additional features to keep some of the deleted information intact, also known as feature engineering. Here’s what the matrix looked like after I was done:

Normality and homoscedasticity of residuals

Residuals show the amount of error you have between the actual data points and your line of best fit. We want our error amounts to be normally distributed and to be homoscedastic. In simpler terms, we don’t want the errors of our model to increase or decrease drastically for different values of our independent variables. If we had a heteroscedastic model, our model’s accuracy would not be consistent. We will check these metrics after each iteration of our model and make adjustments to reach normality and homoscedasticity for the residuals as needed.

All the assumptions are verified, now what?

Phew, okay… Now that we know what all the assumptions are and we have verified linearity and handled multicollinearity, we can move on to building the model (for real this time). The process starts with splitting your data into a training set and a test set. This allows us to train our model with a portion of the data points we have and then test its ability to predict our target without introducing bias. Scikit-learn’s train_test_split is perfect for this purpose. You can import the method with the following code:

`from sklearn.model_selection import train_test_split`

And you can get your x and y values like this:

`X_train, X_test, y_train, y_test = train_test_split(X, y)`

You also have the option to enter a value for the test_size and train_size arguments to change your data split from the 75%-25% default split. Check out the documentation for more information.

Got my dataset split, what’s next?

The process of building the model is an iterative one and we will be continuously adjusting our model based on the normality and homoscedasticity of residuals, the R² value of the model and the p-values of our coefficients. A function would be helpful here if you don’t want to copy and paste your code again and again.

`import statsmodels.api as smimport statsmodels.formula.api as smfimport matplotlib.pyplot as pltimport seaborn as snsdef model_lin_reg(df, target=’price’):    features = ‘ + ‘.join(df.drop(target, axis=1).columns)    f = f”{target}~”+features    model = smf.ols(f, df).fit()    display(model.summary())    fig, ax = plt.subplots(ncols=2, figsize=(15,5))    #qqplot to check normality of residuals    sm.graphics.qqplot(model.resid,line=’45',    fit=True, ax=ax[0])    #scatter plot to check for homoscedasticity of residuals    sns.scatterplot(x=model.predict(df, transform=True),    y=model.resid, ax=ax[1])    ax[1].set_ylabel(‘Residuals’)    ax[1].set_xlabel(‘Predicted’)    plt.axhline();        return model`

Here is an example of what the before and after of my model looked like after I went through the iterations. We can see that the residuals became more normal and homoscedastic. If you would like to know more about what steps I took to go from the first model to the last one you can see my Jupyter notebook.

After all this work, how can we use our model to flip homes?

One of the outputs that we get from our model is the coefficients of the parameters, or in other words, the parameters’ effect on a home’s sale price. For example, we can predict how adding a basement to the house or upgrading the finishes in the house will change the sale price in the end. This can be a very powerful tool and is especially useful when weighing the costs of a renovation against the benefits. If finishing that attic space and therefore adding livable square footage to the home is going to cost you \$25,000 but this renovation is predicted to increase the home’s value by \$50,000 it may be a good idea to go ahead and take that project on.

You can also look at the effect size of each parameter to figure out which of the attributes of the home have the greatest impact on its sale price. Since the coefficients you are looking at are adjusted according to the parameter’s units, you would need to scale the data to see how the parameters stack up against each other. For example: the coefficient of the bedroom count may be 100,000 while the livable square footage’s coefficient may be at 100 since square footage is expressed in terms of hundreds or thousands while bedroom counts are single digits.

So, a multiple linear regression model for your renovation can act as a reference on what to focus on and what not to focus on. By renovating only the attributes that have the greatest value — or the most impact for the least amount of money — you may be able to increase your profits and make your home the most expensive one on the block. Don’t you love the power of data science?

Check out my full notebook and analysis on GitHub by clicking here.

CodeX

Everything connected with Tech & Code

Written by

Berke Tezcan

Aspiring data scientist, video game enthusiast, bookworm, engineer. https://www.linkedin.com/in/ekrem-berke-tezcan/

CodeX

Everything connected with Tech & Code

Written by

Berke Tezcan

Aspiring data scientist, video game enthusiast, bookworm, engineer. https://www.linkedin.com/in/ekrem-berke-tezcan/

CodeX

Everything connected with Tech & Code

In Pursuit of a New Golden Ratio

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app