Regression Problem Case Study | Housing in Buenos Aires (II): Predict Price with Size

Sawsan Yusuf
15 min readOct 26, 2023

--

Photo by Jennifer Deacon on Unsplash

1. Overview

In our previous session, we created a wrangling function with six parts to prepare our dataset. First, we imported the CSV file into a data frame. In the second part, we filtered the data to only include apartments in Buenos Aires priced under $400,000. The remaining parts addressed issues such as missing values, leakage, multicollinearity, and cardinality. After the wrangling process, our dataset now contains three features: Size, location in terms of latitude and longitude, and location in terms of neighborhood.

Today, we will focus on the apartment size feature and analyze its impact on the apartment price. We will use a Linear Model to predict the price based on the apartment’s area in Buenos Aires.

2. Introduction

It’s crucial to take a moment to review the outline of this article before we start delving into the code. This overview is essential in helping us structure our approach to machine learning problems and building efficient machine learning models. Once we’ve covered that, I will introduce the libraries that we’ll be using throughout the article.

So with that in mind, let’s start with the outline:

1. Prepare
1.1 Impot
1.2 Explore
1.3 Split
2. Build Model
2.1 Baseline
2.2 Iterate
2.3 Evaluate
3. Communicate Results

It’s important to note that when building a machine learning model, there are three main sections: preparing data, building a model, and communicating results. Each of these sections has its subsections.

The first step in the data preparation section is to import the data. Then, we perform exploratory data analysis (EDA) to explore the data and understand its structure. Finally, we split the data into two parts: one for training the model and another for testing the model.

For the build model section, which comprises three subsections, firstly, we establish a baseline to determine the minimum level of performance required for the model to be useful in solving the problem. Then, we iterate through building the model, checking, and making adjustments until we are satisfied with the results. Finally, we evaluate the model to see how well it performs.

By following this outline, you can organize your thoughts and stay focused on the task at hand, ensuring that you achieve your desired results.

So that’s it for the outline. Now I would like to highlight the imports that will be used.

# Import libraries

import warnings
from glob import glob

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn import metrics
from sklearn.model_selection import train_test_split

warnings.simplefilter(action="ignore", category=FutureWarning)

As mentioned in the previous article, we have access to several libraries including Matplotlib, Pandas, glop, and warnings. In this article, we will be using scikit-learn to import new libraries. Firstly, we will be using a linear regression model. Secondly, we will import performance metrics, such as the mean absolute error and R-squared, from scikit-learn. Finally, we will be importing the train-test split from scikit-learn to perform a randomized train-test split.

3. Prepare Data

3.1. Import

It’s time to begin preparing the data, and luckily we already have the wrangle function we built in the previous article

def wrangle(filepath):
# Import_csv
df = pd.read_csv(filepath)

# Subset data: Apartments in "Capital Federal", less than 400,000
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
mask_apt = df["property_type"] == "apartment"
mask_price = df["price_aprox_usd"] < 400_000
df = df[mask_ba & mask_apt & mask_price]

# Split "lat-lon" column
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True)

# Drop features with high null counts
df.drop(columns = ["floor","expenses"], inplace= True)

# Drop low and high cardinality categorical variables
df.drop(columns= ["operation", "property_type", "currency","properati_url"], inplace=True)

# Drop Leakey columns
df.drop(columns= [
"price",
"price_aprox_local_currency",
"price_per_m2",
"price_usd_per_m2"
],
inplace= True)

# Drop columns with multicollinearity
df.drop(columns=["surface_total_in_m2", "rooms"], inplace=True)

return df

We will need to utilize the previously mentioned wrangle function to load our data. We can begin by obtaining the names of the files we intend to wrangle. After wrangling them, we can merge them all together into a single dataframe. This will allow us to follow the same steps outlined in the previous article.

# Create a list that contains the filenames for all real estate CSV files
files = glob("buenos-aires-real-estate-*.csv")

# Use the wrangle function in a for loop to create a list named frames
frames = []
for file in files:
df= wrangle(file)
frames.append(df)

# Use `pd.concat` to concatenate the items in frames into a single DataFrame `df`
df = pd.concat(frames, ignore_index= True)

print(df.info())
df.head()
RangeIndex: 8774 entries, 0 to 8773
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 place_with_parent_names 8774 non-null object
1 price_aprox_usd 8774 non-null float64
2 surface_covered_in_m2 8036 non-null float64
3 lat 8432 non-null float64
4 lon 8432 non-null float64
dtypes: float64(4), object(1)
memory usage: 342.9+ KB

We now have a data frame with almost 8,773 entries. Now it’s time to move into doing a little bit more exploratory data analysis.

3.2. Explore

Let’s begin our analysis of the relationship between a house’s surface area and its price by examining the distribution using a histogram. We’ll be using Matplotlib to create the histogram.

# Create a histogram of `"surface_covered_in_m2"`
plt.hist(df["surface_covered_in_m2"])
plt.xlabel("Area [sq meters]")
plt.title("Distribution of Apartment Sizes");
Figure (1): Distribution of apartment sizes.

It seems like we have a problem with our x-axis, which is too long and goes up to 60,000 m2. Despite this, we are unable to see any significant data. We know that Matplotlib automatically creates the axis, so there must be a value in our dataset that is extremely large.

However, based on the histogram, it appears that the majority of the data ranges from 100 to 500 m2, which is expected for apartments in Buenos Aires. Therefore, we may have some outliers, and to confirm this, we should use the describe() method.

# The summary statistics for the `surface_covered_in_m2`
df.describe()["surface_covered_in_m2"]
count     1635.000000
mean 97.877064
std 1533.057610
min 0.000000
25% 38.000000
50% 50.000000
75% 73.000000
max 62034.000000
Name: surface_covered_in_m2, dtype: float64

By looking at the data, we can see that it has a minimum value of zero, which seems unusual because it’s unlikely to sell apartments with no area. On the other hand, we have a maximum value of 62,000 square meters, which is very high. The median value tells us that 50% of our properties are 50 square meters or less, while 75% are 73 square meters or less. That indicates that our data is spread out and skewed.

Another way to look at this is by considering the mean and standard deviation. The mean value is around 100, but the standard deviation is around 1,500, which shows that the data is spread out. Therefore, we need to clean the data further to build a good linear model, which is our goal.

To do this, we should remove the extreme values that exist on either end of our dataset. We want to keep all observations that fall between the 10th or 0.1 and the 90th percent quantiles, which means eliminating the bottom 10% and the top 10% of properties in terms of area. By doing this, we are clipping the extremes and ensuring that our data is more reliable.

# Remove outliers for "surface_covered_in_m2"
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]

After adding this code to our wrangle function and reloading our dataset, let’s recreate our histogram.

# Recreate a histogram of `"surface_covered_in_m2"`
plt.hist(df["surface_covered_in_m2"])
plt.xlabel("Area [sq meters]")
plt.title("Distribution of Apartment Sizes");
Figure (2): Distribution of apartment sizes.

Let’s examine if there’s a correlation between the price and the surface area or the area of an apartment. To investigate this, I will utilize matplotlib to create a scatter plot.

# Exploring the relationship between apartment size and price
plt.scatter(x= df["surface_covered_in_m2"], y= df["price_aprox_usd"])
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]")
plt.title("Price vs Area");
Figure (3): The relationship between apartment size and price.

Based on the plot, it appears that there is a moderate positive correlation between the price and size of an apartment. This suggests that if we want to predict the price of an apartment, then size will be a valuable feature to consider. This information is useful, and we can use it in our model-building process.

Now, let’s take a moment to discuss linear models and what they entail. Linear models are primarily focused on straight lines and distance. By straight lines, we mean that our goal is to draw a line that best fits our data. As seen in Figure 4, this line describes the relationship between an apartment’s area and its price.

Figure (4): The best line describes the relationship between an apartment’s area and its price.

The reason we’re doing that is so that when we encounter a new apartment, for example, if we come across a new apartment with a size of 80 square meters, we can locate the point on the graph where the x-coordinate is 80, and then read off the y-coordinate to get an estimate of the cost. Based on our analysis, we predict that an apartment with 80 square meters would cost around 200,000 dollars (Figure(5)). So, linear models are created to make predictions.

Figure (5): Make predictions with linear models.

The second point, these models focus on distance, meaning that they aim to closely follow the training data. If the line is too far away from the data (as shown on the right side of Figure 6), it’s not a good model. Likewise, if the line doesn’t accurately follow the data (as shown on the left side of Figure 6), especially at the extremes, it’s also not a good model. Therefore, distance is a crucial factor in determining how effective a linear model will be.

Figure (6): Cases from non-effective linear models.

I made this discussion for two important reasons. Firstly, when training a linear model, the algorithm is a mathematical equation that helps it find the best-fit line to minimize the distance between the line and all data points. It’s a computational process based on minimizing distance.

Secondly, linear models are great when your data has a linear relationship, where a straight line can clearly describe the relationship between two variables like apartment size and price. However, if your data doesn’t follow a straight line, using a linear model won’t be effective. So, during exploratory data analysis, if there isn’t a clear linear relationship, it’s better to consider other models.

3.3. Split

Once we’ve done our exploration, it’s time to split our dataset. But what exactly do we mean by the term “split”? There are two types of splits we perform at this stage.

Firstly, we divide our dataset into a feature matrix and a target vector. The feature matrix contains the variables our model examines to generate predictions, while the target vector identifies the variable our model aims to predict. As the feature matrix is multidimensional, it is represented by a capital letter “X.” On the other hand, the target vector is one-dimensional, so it is represented by a lowercase “y.”

Although feature matrices usually contain multiple columns, we will keep it simple for learning purposes and use only one column in this case.

# Create feature matrix and target vector
features = ["surface_covered_in_m2"]
target = "price_aprox_usd"
y = df[target]
X = df[features]
print(X.shape)
print(y.shape)
(6582, 1)
(6582,)

Our feature matrix is structured as a data frame. Upon inspecting its shape, we observe that it has 6582 rows and one column, indicating multidimensionality. However, our target is a series, not a data frame. Upon examining its shape, we see we have 6582 values in it. After this comma, there are no other dimensions, which makes it a vector.

The second split is splitting our dataset into train-test sets. The train-test split is a procedure that divides the data set into two parts — a training set and a testing set. The reason behind this division is to simulate the model’s performance on new data. Approximately 80% of the data is randomly sampled without replacement and used for training, while the remaining 20% is kept for testing.

By doing this, the model can be trained on a subset of the original data and then tested on the remaining portion to evaluate its performance on unseen data. It is important to validate the model’s performance on unseen data to ensure that it can generalize well and make accurate predictions.# Train-test split

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
X_train shape: (5265, 1)
y_train shape: (5265,)
X_test shape: (1317, 1)
y_test shape: (1317,)

Now we are ready to build our model.

4. Build Model

We finally get to build our model. Now remember that there are three different parts to building a model: baseline, iterate, and evaluate. So, let’s start with our baseline.

4.1. Baseline

Understanding the baseline performance of our model is crucial to ensure it is useful for our stakeholders. Essentially, we establish a benchmark for the model’s performance that it must achieve to be worthwhile.

A baseline model is a simple model that serves as a reference point for comparison. It is often called a “naive model” and can only make a single prediction, regardless of the input. This prediction is based on the type of problem we are trying to solve, whether it’s a regression problem or a classification problem.

In our case, we are predicting the price of an apartment, which can range from zero dollars to millions of dollars, making it a regression problem. To establish a baseline, we could simply predict the mean price of the apartments. This approach would not work for a classification problem, but that is not relevant to our project.

# Calculate the mean of the target vector
y_mean = y_train.mean()
y_mean

>>> 132015.1484482431

We need to create a list that repeats this single prediction for every observation in our dataset.

# Generate a list that repeats the prediction for every observation in our dataset
y_pred_baseline = [y_mean] * len(y_train)
y_pred_baseline[:5]
[132015.1484482431,
132015.1484482431,
132015.1484482431,
132015.1484482431,
132015.1484482431]

Now how does our baseline model perform? One way to evaluate it is by plotting it on top of the scatter plot we made above.

# plotting the baseline model
plt.plot(X_train, y_pred_baseline, color="red", label="Baseline Model")
plt.scatter(X_train, y_train)
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]")
plt.title("Buenos Aires: Price vs. Area")
plt.legend();
Figure (7): Buenos Aires: price vs. area.

Let’s analyze the scatter plot we have here. On the x-axis, we can see the apartment sizes, and on the y-axis, we have their prices. The dots on the scatter plot represent the apartments, and each dot has two values associated with it — the size and the price. We see that our model always predicts the same price for all apartment sizes. So, this scatter plot suggests that our baseline model doesn’t follow the trend in the data.

However, as data scientists, we can’t rely on a subjective visualization to evaluate our model. We need to use mathematical performance metrics to measure its performance accurately.

Performance metrics are used to evaluate the performance of a model. And the type of metric we use depends on the type of problem. There are different metrics for different types of problems. For regression problems, the most commonly used metric is the mean absolute error.

To calculate the mean absolute error, we’ll use the mean absolute error function imported from the metrics module in scikit-learn. This function takes two arguments — the true labels or values (y_train) and the predicted values (y_predict_baseline).

We’ll print two things down below — the mean apartment price and our baseline. We’ll round the values to two decimal places to avoid having too many decimal points.

# Calculate the baseline mean absolute error
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
print("Mean apt price", round(y_mean, 2))
print("Baseline MAE:", round(mae_baseline, 2))
Mean apt price 132015.15
Baseline MAE: 44393.95

What can we infer from this information? If we consistently predicted an apartment’s price to be $132,015.15, our predictions would deviate from the actual price by an average of $44,393.46. This indicates that our model’s mean absolute error should not exceed $44,393.46 for it to be considered useful. Hence, we know what needs to be done and can proceed with building our model.

4.2. Iterate

The next step is to build a model. That involves creating a model from scratch, training it, making predictions, and evaluating its performance. This process is repeated until you are satisfied with the model’s performance. So, to begin, we need to instantiate a linear regression model named ‘model’ without any additional arguments. We will keep everything as default.

# Instantiate a `LinearRegression` model
model = LinearRegression()

Now we need to train it using our training data.

# Fit the model
model.fit(X_train, y_train)

4.3. Evaluate

The final step is to evaluate our model. To do that, we’ll start by seeing how well it performs when making predictions for data that it saw during training. So let’s have it predict the price for the houses in our training set.

# Create a list of predictions
y_pred_training = model.predict(X_train)
y_pred_training[:5]
array([172532.64780511, 119255.8759812 ,  92617.49006925,  81518.16260594,
117036.01048854])

Now that we have predictions, we’ll use them to assess our model’s performance with the training data. We’ll use the same metric we used to evaluate our baseline model: mean absolute error.

# Calculate the training mean absolute error
mae_training = mean_absolute_error(y_train, y_pred_training)
print("Training MAE:", round(mae_training, 2))
Training MAE: 30396.72

Good news: Our model beat the baseline by over $10,000! That’s a good indicator that it will help predict apartment prices. But the real test is how the model performs on data that it hasn’t seen before, the test set.

# Calculate the testing mean absolute error
y_pred_test = pd.Series(model.predict(X_test))
print(y_pred_test.head())
mae_testing = mean_absolute_error(y_test, y_pred_test)
print("Testing MAE:", round(mae_testing, 2))
0    199171.033717
1 105936.683025
2 154773.723864
3 97057.221055
4 148114.127386
dtype: float64
Testing MAE: 30633.42

The performance of your model during testing must be similar to its performance during training. In reality, the metrics for testing may be a little worse, indicating a larger number in the case of mean absolute error. However, if the performance during training and testing is close to each other, then you can be confident that the model will generalize well.

During the iteration phase, you can change and retrain your model as many times as you want, and you can also repeatedly check the model’s training performance. However, once you evaluate the test performance, you can no longer make any changes.

A test is only valid if neither the model nor the data scientist has seen the data before. If you check the test metrics and then make changes to the model, you can introduce biases that can compromise its generalizability.

5. Communicate Results

Once our model is built and tested, it’s time to share it with others. If we’re presenting to simple linear model to a technical audience, they might appreciate an equation. When we created our baseline model, we represented it as a line. The equation for a line like this is usually written as y=mx+b

Since data scientists often work with more complicated linear models, they prefer to write the equation as y=β0+β1x

Regardless of how we write the equation, we need to find the values that our model has determined for the intercept and coefficient. Fortunately, all trained models in scikit-learn store this information in the model itself. Let’s start with the intercept.

# Extract the intercept from the model
intercept = round(model.intercept_, 2)
print("Model Intercept:", intercept)

>>> Model Intercept: 12702.33

Next comes the coefficient. We’ll extract it in a very similar way.

# Extract the coefficient associated `"surface_covered_in_m2"` in the model
coefficient = round(model.coef_[0], 2)
print('Model coefficient for "surface_covered_in_m2":', coefficient)

>>> Model coefficient for "surface_covered_in_m2": 2219.87

Now that we have our intercept and coefficient, we need to insert them into a string so that we can print out the complete equation.

# Print the equation that the model has determined for predicting
print(f"Apartment_price = {intercept} + {coefficient} * surface_covered_in_m2")
Apartment_price = 12702.33 + 2219.87 * surface_covered_in_m2

When you’re communicating with other data scientists and your models are not too complex, equations can be very useful. But when you’re dealing with stakeholders, equations are not the best option. Non-technical audiences prefer visual aids, such as a scatter plot, to comprehend data insights. In this article, we can use the scatter plot we created earlier to illustrate the line that the equation would generate.

# The relationship between the observations in `X_train` and the model's predictions
plt.plot(X_train, y_pred_training, color="red", label="Linear Model")
plt.scatter(X_train, y_train)
plt.xlabel("surface covered [sq meters]")
plt.ylabel("price [usd]")
plt.title("Linear Model: Price vs Area")
plt.legend();
Figure (8): Linear Model: Price vs Area.

We have created a linear model that fits the data much better than our baseline model. This is evident from the scatter plot which shows the relationship between the size and price of an apartment in Buenos Aires and how well our model fits the data. The mean absolute error was also much better. This is an exciting start to our project and we look forward to building on this linear model and creating new ones in the upcoming articles. Thank you for reading, see you in the next article.

--

--