Regression Problem Case Study | Housing in Buenos Aires (IV): Predict Price with Neighborhood

Sawsan Yusuf
17 min readNov 10, 2023

--

Photo by Fermin Rodriguez Penelas on Unsplash

In our previous article, we analyzed the impact of latitude and longitude on apartment prices. Today, we will be examining the role of neighborhoods in home prices. We aim to build a model that predicts apartment prices based on neighborhood, using categorical data encoding to train our linear model. We will also address overfitting and use regularization to combat it.

Following the machine learning workflow, we will prepare the data by importing it using the wrangle function and for loop, then clean it to extract the neighborhood information. During the building phase, we will focus on using a One-Hot Encoder to encode categorical variables for model training. Next, we will evaluate our model and learn about overfitting. Finally, we will communicate our results, discussing the curse of dimensionality and regularization, specifically ridge regression. We will also visualize our coefficients using a bar chart. Let’s begin by reviewing the libraries we’ll be using.

import warnings
from glob import glob

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from category_encoders import OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge # noqa F401
from sklearn.metrics import mean_absolute_error
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

warnings.simplefilter(action="ignore", category=FutureWarning)

We will be using some familiar libraries like warnings, which we will filter out warnings. Another library that we will use is glob. It comes with the Python standard library and is great for searching for files that follow a certain pattern. Matplotlib is used for visualizations, Numpy for mathematical operations, and Pandas for data wrangling. These are all our good friends.

Next, we will use a new library called Category Encoders, specifically the One-Hot-Encoder class. Although Scikit-learn has its One-Hot-Encoder, we will use the one that comes with Category Encoders because it is easier to use.

After that, we will use Scikit-learn libraries that we already know, such as SimpleImputer, Linear Regression, and Ridge Regression. In addition, we will use Mean Absolute Error to evaluate our models. We will also use pipelines to make the process smoother.

Lastly, we will add a code snippet to support future warnings. These libraries are our friends, and we will use them to prep our data. Let’s get to work!

1. Prepare Data

1.1. Import

def wrangle(filepath):
# Import_csv
df = pd.read_csv(filepath)

# Subset data: Apartments in "Capital Federal", less than 400,000
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
mask_apt = df["property_type"] == "apartment"
mask_price = df["price_aprox_usd"] < 400_000
df = df[mask_ba & mask_apt & mask_price]

# Split "lat-lon" column
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True)


# Drop features with high null counts
df.drop(columns = ["floor","expenses"], inplace= True)

# Drop low and high cardinality categorical variables
df.drop(columns= ["operation", "property_type", "currency","properati_url"], inplace=True)

# Drop Leakey columns
df.drop(columns= [
"price",
"price_aprox_local_currency",
"price_per_m2",
"price_usd_per_m2"
],
inplace= True)

# Drop columns with multicollinearity
df.drop(columns=["surface_total_in_m2", "rooms"], inplace=True)

# Subset data: Remove outliers for "surface_covered_in_m2"
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]

return df
# Create a list that contains the filenames for all real estate CSV files
files = glob("buenos-aires-real-estate-*.csv")

# Use the wrangle function in a for loop to create a list named frames
frames = []
for file in files:
df= wrangle(file)
frames.append(df)

# Use `pd.concat` to concatenate the items in frames into a single DataFrame `df`
df = pd.concat(frames, ignore_index= True)

print(df.info())
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6582 entries, 0 to 6581
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price_aprox_usd 6582 non-null float64
1 surface_covered_in_m2 6582 non-null float64
2 lat 6316 non-null float64
3 lon 6316 non-null float64
4 place_with_parent_names 6582 non-null object
dtypes: float64(4), object(1)
memory usage: 257.2+ KB

We now have our beautiful data frame and it’s time to begin exploring.

1.2. Explore

We aim to create a model that can accurately predict apartment prices based on neighborhood. Therefore, our primary concern is to ensure that we have the neighborhood information in a format that can be used to build a model. After analyzing our data, we have found that the “place_with_parent_names” column contains the neighborhood information we need to create the model. This means that we have all the necessary information readily available to us.

df["place_with_parent_names"].head()

4 |Argentina|Capital Federal|Chacarita|
9 |Argentina|Capital Federal|Villa Luro|
29 |Argentina|Capital Federal|Caballito|
40 |Argentina|Capital Federal|Constitución|
41 |Argentina|Capital Federal|Once|
Name: place_with_parent_names, dtype: object

We have a problem. The issue is that if we examine the individual strings, we’ll find that there’s a lot of extraneous information that we don’t need, such as the country name (Argentina) and the capital city (Buenos Aires). All we want from this column is the last part of the text that lists different neighborhoods like Chacarita or Villa Luro.

To extract this information and create a new column called “neighborhoods,” we need to split the string using “.str.split()” and separate it using the vertical bar. Then, we can use “expand = True” to convert it into a dataframe where the column of interest is labeled as 3.

# Extract neighborhood
df["neighborhood"]= df["place_with_parent_names"].str.split("|",expand=True)[3]
# Drop the `"place_with_parent_names"` column
df.drop(columns="place_with_parent_names", inplace= True)

Let’s add all of this code to our wrangle function. After that, we need to drop the old column, which we won’t need anymore. Let’s redo the import.

def wrangle(filepath):
# Import_csv
df = pd.read_csv(filepath)

# Subset data: Apartments in "Capital Federal", less than 400,000
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
mask_apt = df["property_type"] == "apartment"
mask_price = df["price_aprox_usd"] < 400_000
df = df[mask_ba & mask_apt & mask_price]

# Split "lat-lon" column
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True)


# Drop features with high null counts
df.drop(columns = ["floor","expenses"], inplace= True)

# Drop low and high cardinality categorical variables
df.drop(columns= ["operation", "property_type", "currency","properati_url"], inplace=True)

# Drop Leakey columns
df.drop(columns= [
"price",
"price_aprox_local_currency",
"price_per_m2",
"price_usd_per_m2"
],
inplace= True)

# Drop columns with multicollinearity
df.drop(columns=["surface_total_in_m2", "rooms"], inplace=True)

# Subset data: Remove outliers for "surface_covered_in_m2"
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]

# Get place name
df["neighborhood"] = df["place_with_parent_names"].str.split("|",expand=True)[3]
df.drop(columns="place_with_parent_names", inplace=True)

return df
# Create a list that contains the filenames for all real estate CSV files
files = glob("buenos-aires-real-estate-*.csv")

# Use the wrangle function in a for loop to create a list named frames
frames = []
for file in files:
df= wrangle(file)
frames.append(df)

# Use `pd.concat` to concatenate the items in frames into a single DataFrame `df`
df = pd.concat(frames, ignore_index= True)

print(df.info())
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6582 entries, 0 to 6581
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price_aprox_usd 6582 non-null float64
1 surface_covered_in_m2 6582 non-null float64
2 lat 6316 non-null float64
3 lon 6316 non-null float64
4 neighborhood 6582 non-null object
dtypes: float64(4), object(1)
memory usage: 257.2+ KB

Our dataset now includes a ‘neighborhood’ column. That’s all for exploration. Let’s move on to splitting.

1.3. Split

The next step is to split our data into two parts — one for the X variable and the other for the y variable.

# Create feature matrix and target vector
features = ["neighborhood"]
target = "price_aprox_usd"
y = df[target]
X = df[features]

The second split, train-test split:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

2. Model Build

Before we begin building the model, there are a few new concepts to cover. We will discuss OneHotEncoder and overfitting, but first, we will start by determining the baseline performance of our model, which should be a familiar process by now.

2.1. Baseline

We will combine the multiple steps from previous articles into a single step.

# Calculate the baseline mean absolute error
y_mean= y_train.mean()
y_pred_baseline =[y_mean] * len(y_train)
print("Mean apartment price:", round(y_mean, 2))
print("Baseline MAE:", mean_absolute_error(y_train, y_pred_baseline))


>>> Mean apartment price: 132015.15
Baseline MAE: 44393.95213998732

2.2. Iterate

We are currently in the iteration phase, where we focus on building our model and exploring any necessary transformations to make accurate predictions. Before we proceed, If we try to fit a LinearRegression predictor to our training data at this point, we'll get an error that looks like this:

ValueError: could not convert string to float

What does this mean? When we fit a linear regression model, we’re asking scikit-learn to perform a mathematical operation. The problem is that our training set contains neighborhood information in non-numerical form. To create our model we need to encode that information so that it’s represented numerically. The good news is that there are lots of transformers that can do this. Here, we’ll use the one from the Category Encoders library, called a OneHotEncoder.

Before we build and include this transformer in our pipeline, let’s explore how it works. OneHotEncoding (OHE) is an encoding strategy we can use. It involves going through all the data in the column we want to encode, finding all the unique values, and creating new columns for each unique value.

So, instead of having one column with three values, we transformed it into three columns using one hot encoding. This means that we can represent categorical data using zeros and ones. For instance, we can use “1” to represent “Belgrano” in the first column, and “0” for the second and third columns.

One-hot encoding is a technique that allows us to convert a single column of text data into a larger set of columns, where each value is represented by zeros and ones. This is the fundamental idea behind one-hot encoding. Once we understand how the `OneHotEncoder` works, we can integrate it into our pipeline.

First, I need to use a one-hot encoder and set “use cat names” to true to ensure that categorical variables are properly encoded. Then, I will proceed with my linear regression predictor using the default settings without any additional arguments. The final step is to fit my model to the training data, which consists of X_train and y_train.

model = make_pipeline(
OneHotEncoder(use_cat_names= True),
LinearRegression()
)
model.fit(X_train, y_train)

And next, we need to see how it’s performing.

2.3. Evaluate

The first step in evaluating our model is, we need to see how it Performed on the training data. So let’s begin doing that again.

# Calculate the training mean absolute error
y_pred_training = model.predict(X_train)
print("Training MAE:", mean_absolute_error(y_train, y_pred_training))

>>> Training MAE: 39198.66121557455

Wow, that looks great! The number of data points has decreased from 45,044 to 39,000, indicating that the model is making a significant contribution. However, the training data only tells half the story. The real question is whether our model can perform well on unseen testing data. So, in the next step, let’s evaluate our model’s performance on such data.

# Calculate the testing mean absolute error
y_pred_test = pd.Series(model.predict(X_test))
mae_testing = mean_absolute_error(y_test, y_pred_test)
print("Testing MAE:", round(mae_testing, 2))

>>> 812368101112651.6

3. Communicate Results

In this section, we are discussing our results. However, we have a concern about our model’s performance. Our model has performed exceptionally well with the training data, but when it comes to the test data, the performance is poor. The Mean Absolute Error (MAE) for our training data was around $39,000, whereas for the test data, it was over a billion dollars. This indicates that our model is not generalizing well.

Now, the question is, why is this happening? In this section, we will focus on the concept of dimensionality, which we discussed in the previous two articles.

To understand dimensionality, let’s go back to the beginning of building our model. Initially, we started with just one dimension: the house's size. We had a coefficient for the size, which gave us the price. Then, we added two more dimensions, latitude, and longitude, which required two more coefficients. So, we went from two dimensions to three dimensions.

Our feature matrix has expanded to 57 columns, making our equation so large that it no longer fits on the screen. I will have to abbreviate it and imagine the remaining 55 terms. However, it is important to write out this equation to understand its significance.

Before diving into the topic, let’s take a moment to acknowledge that we’re now dealing with data that exists in a super high-dimensional space, unlike the two- or three-dimensional data that we’re used to. As a result, we might face issues like overfitting, and we need to use regularization techniques to make our model more generalized.

Moving on, we have an unusual equation that needs to be dealt with differently to extract the coefficients from our model and display them understandably to others. So, let’s first focus on how to extract the intercept and coefficients. The process of extracting the intercept will be the same as the last time, where we’ll access our model, use named_steps, and within square brackets, we’ll do linear regression and then dot intercept. The same process applies to extracting coefficients.

# Extract the intercept and coefficients for our model
intercept = model.named_steps["linearregression"].intercept_
coefficients = model.named_steps["linearregression"].coef_
print("coefficients len:", len(coefficients))
print(coefficients[:5])
coefficients len: 56
[-1.06988879e+18 -1.06988879e+18 -1.06988879e+18 -1.06988879e+18
-1.06988879e+18]

We have 56 coefficients in total, but we are only looking at the first five of them. These coefficients seem to have extreme values and look a little funky. While we have the values for the intercept and coefficients, we don’t have their corresponding neighborhood names. Therefore, the next step is to extract the names of the neighborhoods from our pipeline.

# Extract the feature names
feature_names = model.named_steps["onehotencoder"].get_feature_names()
print("features len:", len(feature_names))
print(feature_names[:5])
features len: 56
['neighborhood_Palermo' 'neighborhood_San Cristobal'
'neighborhood_Caballito' 'neighborhood_Villa Devoto'
'neighborhood_San Telmo']

After obtaining our coefficients and a list of 56 neighborhood values, we must combine them.

# Create a pandas Series named `feat_imp`
feat_imp = pd.Series(coefficients, feature_names)
feat_imp.head()
neighborhood_Palermo         -1.069889e+18
neighborhood_San Cristobal -1.069889e+18
neighborhood_Caballito -1.069889e+18
neighborhood_Villa Devoto -1.069889e+18
neighborhood_San Telmo -1.069889e+18
dtype: float64

Let me clarify the text for you. The first part shows the index of the neighborhood, followed by the coefficients. We can print the equation by first displaying the intercept and then using a for loop to iterate through the series of coefficients and feature names to print them out. This loop will print out all the necessary information in a clear and organized manner.

# Print the equation that model has determined for predicting
print(f"price = {intercept.round(2)}")
for f, c in feat_imp.items():
print(f"+ ({round(c, 2)} * {f})")
price = 1.0698887891127702e+18
+ (-1.0698887891126071e+18 * neighborhood_Palermo)
+ (-1.069888789112661e+18 * neighborhood_San Cristobal)
+ (-1.0698887891126405e+18 * neighborhood_Caballito)
+ (-1.0698887891126472e+18 * neighborhood_Villa Devoto)
+ (-1.0698887891126444e+18 * neighborhood_San Telmo)
+ (-1.0698887891125768e+18 * neighborhood_Recoleta)
+ (-1.0698887891126595e+18 * neighborhood_Barracas)
+ (-1.0698887891126523e+18 * neighborhood_Almagro)
+ (-1.0698887891126409e+18 * neighborhood_Villa Crespo)
+ (-1.0698887891126573e+18 * neighborhood_Chacarita)
+ (-1.0698887891126083e+18 * neighborhood_Nuñez)
+ (-1.0698887891126643e+18 * neighborhood_Agronomía)
+ (-1.0698887891126616e+18 * neighborhood_Congreso)
+ (-1.0698887891125998e+18 * neighborhood_Barrio Norte)
+ (-1.0698887891126586e+18 * neighborhood_Paternal)
+ (-1.069888789112659e+18 * neighborhood_Flores)
+ (-1.0698887891126546e+18 * neighborhood_Once)
+ (-1.0698887891126627e+18 * neighborhood_Balvanera)
+ (-1.069888789112666e+18 * neighborhood_Villa del Parque)
+ (-1.0698887891126536e+18 * neighborhood_Abasto)
+ (-1.0698887891126394e+18 * neighborhood_Saavedra)
+ (-1.06988878911264e+18 * neighborhood_Villa Urquiza)
+ (-1.0698887891126188e+18 * neighborhood_Colegiales)
+ (-1.0698887891126056e+18 * neighborhood_Belgrano)
+ (-1.069888789112665e+18 * neighborhood_Liniers)
+ (-1.0698887891126728e+18 * neighborhood_Monserrat)
+ (-1.0698887891126566e+18 * neighborhood_Boedo)
+ (-1.0698887891126734e+18 * neighborhood_Villa Santa Rita)
+ (-1.0698887891126664e+18 * neighborhood_Monte Castro)
+ (-1.0698887891126527e+18 * neighborhood_Villa Luro)
+ (-1.069888789112664e+18 * neighborhood_San Nicolás)
+ (-1.069888789112684e+18 * neighborhood_Parque Chas)
+ (-1.0698887891126472e+18 * neighborhood_Villa General Mitre)
+ (-1.069888789112658e+18 * neighborhood_Versalles)
+ (-1.0698887891126423e+18 * neighborhood_Coghlan)
+ (-1.0698887891126616e+18 * neighborhood_Centro / Microcentro)
+ (-1.0698887891126496e+18 * neighborhood_Tribunales)
+ (-1.0698887891126614e+18 * neighborhood_Parque Centenario)
+ (-1.0698887891125257e+18 * neighborhood_Puerto Madero)
+ (-1.0698887891126797e+18 * neighborhood_Boca)
+ (-1.069888789112657e+18 * neighborhood_Villa Pueyrredón)
+ (-1.0698887891126675e+18 * neighborhood_Floresta)
+ (-1.0698887891126682e+18 * neighborhood_)
+ (-1.0698887891126661e+18 * neighborhood_Parque Patricios)
+ (-1.0698887891126632e+18 * neighborhood_Villa Ortuzar)
+ (-1.0698887891126964e+18 * neighborhood_Constitución)
+ (-1.0698887891126222e+18 * neighborhood_Retiro)
+ (-1.069888789112702e+18 * neighborhood_Villa Lugano)
+ (-1.0698887891126853e+18 * neighborhood_Velez Sarsfield)
+ (-1.0698887891126611e+18 * neighborhood_Parque Chacabuco)
+ (-1.0698887891126721e+18 * neighborhood_Mataderos)
+ (-1.0698887891125793e+18 * neighborhood_Las Cañitas)
+ (-1.0698887891126788e+18 * neighborhood_Parque Avellaneda)
+ (-1.069888789112669e+18 * neighborhood_Villa Real)
+ (-1.069888789112706e+18 * neighborhood_Pompeya)
+ (-1.069888789112732e+18 * neighborhood_Villa Soldati)

Our model has determined an equation that it believes is the best way to predict apartment prices. However, its performance is not up to par. If we examine the equation, we can see why. The intercept and coefficients are extreme, with very large values, either positive or negative. This phenomenon is called the curse of dimensionality.

To illustrate this, let’s consider a two-dimensional dataset. In this dataset, there is a distance between each point. However, as we add more dimensions, the points in our dataset become further apart. In a high-dimensional space, a linear model becomes unstable and starts favoring certain data points. As a result, the coefficients become large and unpredictable. This makes it difficult for the model to generalize to new data.

The solution to this problem is regularization. Regularization helps to prevent the linear model from overfitting to the training data, allowing it to generalize better to new data.

Regularization is not a single technique, but rather a collection of techniques that we will explore. Essentially, regularization is used when a model is overfitting to the data and acting erratically. When this happens, we need to “cool down” the model and make it act more regular. One popular technique for linear regression is called ridge regression.

Linear regression is all about fitting the model as closely as possible to the data points. Ridge regression also does this, but adds a penalty that ensures that the coefficients of the model do not become too large. We want the model to fit the data as closely as possible, but we also want to keep the coefficients in check and avoid overfitting.

To demonstrate the difference between linear regression and ridge regression, we will pull out the linear model and replace it with a ridge model. However, it’s important to note that this should not be done in real life. We will be evaluating the model using both the training and test data to show the difference, but in practice, this should never be done.

# To avoid the overfitting, change the predictor in our model to `Ridge`
model = make_pipeline(
OneHotEncoder(use_cat_names= True),
Ridge()
)
model.fit(X_train, y_train)
y_pred_training = model.predict(X_train)
mae_training = mean_absolute_error(y_train, y_pred_training)
print("Training MAE:", round(mae_training, 2))

>>> Training MAE: 39231.61
# Calculate the testing mean absolute error
y_pred_test = pd.Series(model.predict(X_test))
mae_testing = mean_absolute_error(y_test, y_pred_test)
print("Testing MAE:", round(mae_testing, 2))

>>> Testing MAE: 40031.31

We observe that the mean absolute error (MAE) for our training data is 39,000, which is consistent with the previous value. Moreover, our test MAE is 40,000, which is comparable to our training MAE. This indicates that we have successfully overcome the overfitting problem and surpassed the baseline.

After addressing the issue of overfitting in our model, let’s examine the equation to verify any modifications.

intercept = model.named_steps["ridge"].intercept_
coefficients = model.named_steps["ridge"].coef_
feature_names = model.named_steps["onehotencoder"].get_feature_names()

feat_imp = pd.Series(coefficients, feature_names)

print(f"price = {intercept.round(2)}")
for f, c in feat_imp.items():
print(f"+ ({round(c, 2)} * {f})")
price = 118677.7
+ (45339.9 * neighborhood_Palermo)
+ (-10264.82 * neighborhood_San Cristobal)
+ (9967.37 * neighborhood_Caballito)
+ (5455.97 * neighborhood_Villa Devoto)
+ (6559.29 * neighborhood_San Telmo)
+ (73362.46 * neighborhood_Recoleta)
+ (-4605.37 * neighborhood_Barracas)
+ (-1277.93 * neighborhood_Almagro)
+ (9844.86 * neighborhood_Villa Crespo)
+ (-4758.37 * neighborhood_Chacarita)
+ (42652.58 * neighborhood_Nuñez)
+ (-10109.3 * neighborhood_Agronomía)
+ (-10076.05 * neighborhood_Congreso)
+ (53537.52 * neighborhood_Barrio Norte)
+ (-5944.79 * neighborhood_Paternal)
+ (-7017.33 * neighborhood_Flores)
+ (-1924.93 * neighborhood_Once)
+ (-10706.57 * neighborhood_Balvanera)
+ (-12999.99 * neighborhood_Villa del Parque)
+ (-393.82 * neighborhood_Abasto)
+ (14496.56 * neighborhood_Saavedra)
+ (11375.41 * neighborhood_Villa Urquiza)
+ (33082.75 * neighborhood_Colegiales)
+ (44827.24 * neighborhood_Belgrano)
+ (-12298.59 * neighborhood_Liniers)
+ (-20031.74 * neighborhood_Monserrat)
+ (-5405.43 * neighborhood_Boedo)
+ (-21728.24 * neighborhood_Villa Santa Rita)
+ (-14542.04 * neighborhood_Monte Castro)
+ (-998.46 * neighborhood_Villa Luro)
+ (-11458.74 * neighborhood_San Nicolás)
+ (-29675.46 * neighborhood_Parque Chas)
+ (5450.4 * neighborhood_Villa General Mitre)
+ (-6342.16 * neighborhood_Versalles)
+ (9865.18 * neighborhood_Coghlan)
+ (-9178.62 * neighborhood_Centro / Microcentro)
+ (1792.13 * neighborhood_Tribunales)
+ (-8747.59 * neighborhood_Parque Centenario)
+ (123616.51 * neighborhood_Puerto Madero)
+ (-27323.71 * neighborhood_Boca)
+ (-4338.21 * neighborhood_Villa Pueyrredón)
+ (-14796.34 * neighborhood_Floresta)
+ (-16089.24 * neighborhood_)
+ (-13538.6 * neighborhood_Parque Patricios)
+ (-10710.51 * neighborhood_Villa Ortuzar)
+ (-43187.56 * neighborhood_Constitución)
+ (29627.23 * neighborhood_Retiro)
+ (-48009.51 * neighborhood_Villa Lugano)
+ (-27342.16 * neighborhood_Velez Sarsfield)
+ (-8995.67 * neighborhood_Parque Chacabuco)
+ (-19434.82 * neighborhood_Mataderos)
+ (70209.44 * neighborhood_Las Cañitas)
+ (-24059.93 * neighborhood_Parque Avellaneda)
+ (-12118.47 * neighborhood_Villa Real)
+ (-40466.86 * neighborhood_Pompeya)
+ (-60164.92 * neighborhood_Villa Soldati)

If we examine the coefficients of the equation, they seem to make more sense now. Logically, being in a certain neighborhood would increase the value of your house by, say, $10,000, but certainly not by $10 billion or any such ridiculous amount. With these changes, our model looks much better now.

We’re back on track with our model, so let’s create a visualization that will help a non-technical audience understand the most important features of our model in predicting apartment prices. And in this case, we should use something like a bar chart.

# Bar chart that shows the top 15 coefficients for our model, based on their absolute value
feat_imp.sort_values(key=abs).tail(15).plot(kind="barh")
plt.xlabel("Importance [USD]")
plt.ylabel("Features")
plt.title("Features Importance for Apartment Price");

The neighborhood of Puerto Madero is linked with a significant increase of almost 125,000 in the price of apartments, which is reasonable to those familiar with Buenos Aires. Similarly, the neighborhood of Recoleta also exhibits a similar trend. Conversely, neighborhoods such as Villa Soldati, situated in less affluent areas of the city, are linked with a decrease in apartment prices.

So, what we can observe from this model is that it is beginning to make sense with the information we know. If we look at it in the context of the problem, we can see that living in a wealthy neighborhood increases the predicted value of the property, while living in a working-class neighborhood decreases the predicted value of the property.

Conclusion

Well, we come to the end of the artcle, let's think about some of the cool things that we did. First, when we were preparing data, we learned how to import multiple CSV files using a for loop. The other thing we learned is we were able to pull out that neighborhood to create that neighborhood column.

When it came to building the model, we learned about OneHotEncoding. So, take categorical data and encode it numerically so that we can use it to train a linear model.

And then, we came across an important issue known as overfitting. This occurs when a model performs exceptionally well on the training data but fails to generalize on the test data. We also learned various communication techniques, such as extracting coefficients and coefficient names from our model. Additionally, we discussed the curse of dimensionality, which explains that a linear model in high-dimensional space may not know how to handle overfitting. To tackle this problem, we learned about regularization and swapped out our linear regression predictor with a ridge predictor. Although we looked at the training data twice, which is not recommended, we saw how a ridge regressor can help overcome the problem of overfitting.

Finally, we created a horizontal bar chart to display to stakeholders the neighborhoods with the strongest eating associations and their effect on housing prices.

Today’s article was full of new concepts, and we will be building on them in the next article.

Goodbye!

--

--