Ask Julio about Regression

Deepesh Wadhwani
7 min readApr 30, 2020

--

A property broker can teach ‘what is regression analysis’ better than most people. Let us try to learn about the most widely used technique from his story.

Julio becomes a Property Broker

Julio always wanted to be in the real estate domain. He started his career as most people do, as an apprentice to an experienced and master property broker. There is a lot to learn in the real estate work profile, among them is one of the most challenging problems: How to know beforehand the expected price of a house which has not been sold yet. During his apprenticeship, he followed the master with the hope of understanding how he estimated the expected market value of a house.

At this moment, if we ask ourselves what factors will fetch more money for a house, the answer will be simple. The locality, size of the house (also called its square footage), amenities like swimming pool or gymnasium, distance from the nearest market, and school, all these and countless other factors affect house prices.

In Machine Learning lingo, the ‘House Price’ (or any such variable that we want to predict) is called a TARGET VARIABLE. And the other factors like the square footage of the house (usually anything that is the answer of a simple factual question like ‘what is the size of your house?’) are referred to as EXPLANATORY VARIABLES.

Julio, too realized that price mainly depends on various features. However, there is still some part of the value of the final sale, which depends on the mood of the buyer and seller. If they feel euphoric, they let go of some stubbornness during negotiations. Whereas, if they are low-spirited, then they are more unyielding. Either way, Julio realized that the analysis could predict the price close to the final value, yet seldom 100% accurate. He also recognized this as a fact of life and continued learning anyways.

The House Price or any such variable whose exact value is not known and may depend on a random phenomenon like the mood of a person is called RANDOM VARIABLE.

Julio has now made his conclusion:

(Note: the sign ~ is read as ‘is dependent on’)

This construct of learning for Julio, where a master is present to tell him if his prediction of the house price was correct or wrong (or wrong by how much amount) is called as SUPERVISED LEARNING.

From the ongoing story, we can make a few more observation:

  • Some variables, like house price, square footage, or distance from the market, are continuous numbers and can even have decimal values. These are called CONTINUOUS VARIABLE.
  • Other variables like swimming pool Yes/No, Gym Yes/No, House facing direction North/South/East/West are discreet. These variables are called CATEGORICAL VARIABLES.

Coming back to Julio-

He now knows that:

As we can easily observe that in Julio’s analysis, the target variable is continuous. This construct of the problem is called REGRESSION ANALYSIS.

Julio continuous to be the shadow of the master broker and, over the years, gets promoted from being an apprentice to a Junior Broker. He saw hundreds of houses, sat in many negotiations, and with the experience, he got better at predicting house prices. During this time with the master, he maintained a ledger of all houses he visited for his own edification. The details look like follows:

The above dataset or any other such dataset which has more than one Explanatory Variables is referred to as ‘MULTIVARIATE.’ It needs a “MULTIPLE REGRESSION ANALISIS” in which more than one Explanatory variable is used to predict the value of a Target Variable.

Julio realized that handling so many variables might be more difficult, and he should start small. He decided to instead use just one Explanatory Variable for now — The Square Footage of the house, effectively reducing the dataset to:

This construct of problem, which uses only one Explanatory Variable, is called “SIMPLE REGRESSION ANALYSIS.”

Julio also realized that a visual representation of data might be better to draw any conclusion regarding pricing. He plotted Square footage on the horizontal axis (aka x-axis) and House Price on Vertical (aka y-axis).

Every point on the plot represents a house from the table above. The X-coordinate is the Square footage of the house, and the Y-coordinate is its price. Thus the number of points on the plot is equal to the number of rows in the above dataset.

This plot is referred to as SCATTER PLOT.

As soon as the plot was ready, Julio realized that bigger the house, pricier it will be, and to understand how price varied, he drew a central line by hand.

This method of plotting a straight line through the scatter plot is called LINEAR REGRESSION. Because we are using just one explanatory variable, we can call it SIMPLE LINEAR REGRESSION.

He argued that the hand-drawn line passes through the middle of all the points and hence reflects the very nature of how real estate pricing behaves.

He also noted that as the line passes through the origin — that means if Square footage is zero, the price will be zero as well. He knows that it’s a dumb conclusion. Still, he was happy that analysis automatically arrived at it without being explicitly told. This gave him better confidence in his study.

He observed another critical information from the line, the tilt of it. He explained that the slope shows how expensive the city is, a higher slope means that the price rises rapidly if the area of the house increases.

Till now in the shadow of the master, Julio never uses to give out his price predictions to the clients. But gone are those times, Julio is not a Junior broker. He is now supposed to go to the client meetings alone, and thus will have to predict the house price himself. He feels confident after this analysis as he came up with an intuitive way to predict the price. Most of the information he needs comes from his graph, and the little other that he needs is acquired from the client by asking a straightforward factual question — “what is the square footage of your house?”

The answer SF value he gets from the client can be located on the x-axis of his graph and checks for the point right above this SF value on the red line. This point on the redline, in all likelihood, represents the client’s house, and “thus, the price corresponding to this point will be very close to the price on which house will be eventually sold,” he argued.

Julio knows that even if this predicted value is not 100% accurate, it comes very close to the real value.

While driving back from the client’s house, Julio thought to himself, “How wonderful it will be if I can use all the Explanatory Variables instead of just the Square Footage.” He felt optimistic and decided to formally learn REGRESSION ANALYSIS.

~~ The Author intended this article to be read as Pre-Read before a regression analysis session. ~~

About ‘Ask Julio…’ series:

Ask Julio (pronounced like this) is a series attempting to explain Machine Learning to the ordinary; to those who in high school questioned what math does in real life; to those who have seen a code screen only in movies; and also to those who hear the words artificial intelligence and imagine Arnold Schwarzenegger from The Terminator.

As to speak to the 99%, Julio, the lead character of the article series, steps into the shoes of a lot of professions to experience how they learn things. During this journey, he asks a critical question, ‘How can anything be “learned”?’

This series of articles hopes to inspire people to learn Machine Learning by demonstrating one key point — it’s easier done than said.

‘Ask Julio…’ is an ongoing work and you can find all the other articles in the series here.

About the Author:

Deepesh Wadhwani, a Mechanical Engineer turned Data Scientist, has been associated with Data Analysis and Prediction for many years. He used to create Risk Prediction models for Wells Fargo Bank, and Pricing Models for Mitsubishi FUSO (Japan) before he pivoted his work on academics and started as Senior Faculty of Machine Learning and Data Science at International School of AI and Data Science. The cornerstone of his pedagogy is the effective translation of the Math (the primary language of ML algorithms) to run-of-the-mill arguments which enables all in the audience to connect with the complex algorithms. In his classes, he uses a blend of logical conclusions and live demonstrations to reinforce the learnings.

This series of articles hopes to inspire people to learn Machine Learning by demonstrating one key point — it’s easier done than said.

‘Ask Julio…’ is an ongoing work and you can find all the other articles in the series here.

--

--