How To Make Statistical Forecasts The Hard Way: DIY Regression Lines (#BusinessyBrunette HBX Week 7)

Creatrix Tiara
12 min readFeb 28, 2016

Welcome to #BusinessyBrunette! I am currently studying Harvard Business School’s HBX CORe, their online pre-MBA program teaching the fundamentals of financial accounting, business analytics, and economics. Every week I‘ll write up what I’ve learned — making it meaningful & accessible for artists, activists, geeks, nerds, fans, and anyone else who doesn’t fit the MBA Mould. I’m learning as I go, so feel free to critique, comment, tell me if I’ve messed up or did well, mash up, and share! [See the rest of the series here]

This piece was written with the help of my math consultant Elizabeth Goodman, a Mathematics Ph.D. from Stanford and an all-round badass. She’s currently looking for industry work, so if you need an intelligent, kind, passionate, friendly maths & social justice person on your team get in touch with her!

Business Analytics 4: Single Variable Linear Regression

Companies and organizations often use regression analysis to figure out the relationship between two variables. Often these relates to price — such as the selling price for houses by size, jewelry by carat, or ice-cream by temperature. Through regression analysis, they come up with a regression line, or “best fit” line to make forecasts about future data.

Now usually you’d use Excel or a specialist statistical program to create these lines for you, through functions such as =FORECAST, =CORREL, or the Regression Output Table. However, there is a way to do this manually —whether to better understand how these functions work, just for funsies, or because you’re some kind of mathematical masochist (I won’t kinkshame).

For this demonstration, we’ll use data about height and shoe size as collected by Constance H. McLaren and posted on the Journal of Statistics Education Data Archive, which has a ton of datasets for educational purposes. Since the data set is long, I’ll post screenshots of snippets, but you can also view the data for yourself right here on Google Docs.

1. Figure out your X-axis and Y-axis

The variable on your X-axis, the horizontal axis, is also known as the independent variable. This usually involves data that doesn’t change solely because of what the Y-axis is doing. Time is a common independent variable; also common are sizes (relative to sales), temperature (relative to human behavior), or past sales success (relative to future sales success).

The variable on your Y-axis, the vertical axis, is also known as the dependent variable. This is usually the factor you’re trying to predict, and will change in relation to the X-axis. The amount of money you make is a very common dependent variable. How much would a house sell at a given price? How much ice-cream gets eaten on a cold or warm day? We made an additional $100,000 ever year for the past few years — will that continue?

Think of it as building a sentence: “The more [X], the more/less likely [Y]”. The bigger the house, the more it’ll sell for. The warmer the weather, the more ice-cream you’ll sell, and so on.

In our example, both “the taller you are, the larger your shoe size” and “the larger your shoe size, the taller you are” kind of work, but the latter is a little more awkward (your shoe size doesn’t affect how tall you are), so we’ll make Height the X-Axis and Size the Y-Axis.

2. Figure out the average (mean) of both X and Y independently.

Pretty simple — add up everything in that column and then divide it by the # of items in your column. For simplicity’s sake, I had decided to round up the results because I didn’t want to deal with too many decimals when it comes to averages. (It’s more useful to say “the average height is 68” rather than “the average height is 68 point a gazillion”.)

However, rounding will remove accuracy. This introduces errors to your data, which will eventually add up. We’re lucky that in this case the errors are not that big, but if you really want a good prediction, try not to round up numbers so early in the process. We’ll see what errors this causes in a few steps from now.

3. Calculate the difference between each data point and its mean.

For example, if your data point has a height of 80 inches and the mean is 68, the difference is 12. Put the results in their own columns.

4. Make three columns: X*Y, X², and Y², and calculate them for each difference.

In each column, make each calculation for each equivalent difference per data point (usually the same row). So you’re not working with a height of 60 and a size of 9 — you’re working with a height difference of -8 and a size difference of -1. In this instance, -8 multiplied by -1 (X*Y) is 8, -8 squared (X²) is 64, and -1 squared (Y²) is 1.

5. Add up each column.

The most important columns to add up are the X*Y, X² and Y² columns. This will give us the sum of squares, which we’ll need for the rest of this exercise. Again, for the purposes of this demo, I’ve rounded the numbers to make them easier to handle.

This step shows you why you want squares: because you want to eliminate the negatives. If you just added up the total height and size differences, the negatives and the positives start cancelling each other out, sometimes to the point of negative results. Squares are always positive, so they make good working numbers.

Here’s where the errors of rounding come to play. If we didn’t round our numbers, Total Height and Size Differences should be zero. The idea is that the negatives and positives should be cancelling each other out. But because we rounded our information early, we suddenly see numbers that probably shouldn’t be there.

For the purposes of regression analysis, those totals aren’t that important, but they can be important for error-checking. Here’s what the totals would be without rounding, and with rounding to one decimal point rather than with no decimal points like earlier.

For the rest of this piece we’ll continue with the rounded totals, but keep that error in mind.

6. Calculate the correlation coefficient, or r.

The correlation coefficient tells you how strong the correlation is between two variables — basically, how true that sentence you made in #1 (“the more X the more/less likely Y”) actually is. It’s always between 0 and 1; the closer to 1, the more correlated your data. (Though be aware of spurious correlations! There are many ways that such supposed correlations can be misleading.)

The formula for r looks a little scary, but we basically have everything we need. It is the sum of squares for X*Y divided by the square root of the product of the sum of squares for X² and Y².

Let’s plug our numbers into this formula. Our total for X*Y is 3069, our total for X² is 7293, and our total for Y² is 1741.

Multiply the bottom half together (7293 * 1741) and you get 12697113.

The square root of 12697113 is 3563.300857.

3069 divided by 3563.300857 is 0.8612800667. This is our correlation coefficient, or r!

Remember what we said up ahead about how the closer r is to 1, the more correlated our data is. Here, we can see that there is a pretty strong correlation between height and shoe size — it is possible to say “the taller you are, the bigger your shoe size”. However, it’s not as strong as it could be; indeed, if you look at the data set, there’s a fair bit of variation.

Note that the acceptable r value is very different depending on circumstance. If you’re studying the effects of human behavior, you may be allowed a lower r value, because human behavior is hard to predict. For events that have a stronger causal relationship — say, house sizes and selling prices — you may need higher r values.

OK! So we’ve made it this far — well done! — but there’s still a bit left to do. Let me first introduce you to this formula — ŷ = a + bx:

This formula is what you need to calculate your regression/best fit line. You may notice that it’s very similar to the slope intercept form, y = mx + b — because that’s essentially what this is. You’re trying to create a line on a graph that has a particular slope, direction, and intercept points.

, charmingly also known as y-hat, is the dependent variable or response variable — here, it’s the value you’re trying to predict. What shoe size would someone that’s 65 inches (a little over 5'4", or 1.65 metres) tall have? That shoe size you’re looking for is y-hat.

a is the y-intercept, the point on the Y-axis where x = 0. The higher this number, the higher your line will sit on the chart. Calculating this is a bit of a process — to do this you need…

…b, which is also your slope. This determines how steep your line is — the higher the number, the steeper it is.

x is the independent variable, which you need to find y-hat. In that question above about the shoe size of someone who is 65inches tall, 65 is your x value.

Let’s keep going with our regression line!

7. Calculate the slope (b).

The formula for the slope is the sum of squares for X*Y divided by the sum of squares for X². In our case, we already have those numbers from earlier, 3069 and 7293, and when we divide them we get 0.4208144796.

8. Calculate the y-intercept (a).

The formula for the y-intercept sounds a lot scarier than it really is, especially since we’ve done the hard work already. It is the average of Y minus the product of the slope and the average of X.

Let’s plug in our numbers. Our average for Y, the average shoe size (also known as y-bar) is 10. Our slope, as calculated earlier, is 0.4208144796. Our average for X, the average height (also known as x-bar) is 68.

0.4208144796 multiplied by 68 is 28.61538462.

10 minus 28.61538462 is -18.61538462 — which is now our y-intercept, or a.

9. Build your regression line equation and test it.

Hurrah! You’ve done most of the heavy lifting, and you now have an equation for your regression line. For this demo, your regression line equation is ŷ = -18.61538462 + 0.4208144796(x):

Let’s try this with the question we had earlier. What is the likely shoe size for someone who is 65 inches tall?

Someone who is about 65 inches / 5'4" / 1.65 m tall would have a shoe size of about 9, rounded up. That sounds about right — that’s roughly my height and my shoe size. Let’s look at our data set for shoe sizes for people who are 65 inches tall:

There’s quite a bit of variation, but 8.5 to 9.5 in shoe size is within the realm of possibility for someone who is 65 inches tall. (To really measure the span of this realm of possibility you’d need to create prediction intervals, which are like confidence intervals but slightly different. That’s outside the scope of this article though!)

We don’t have anyone in our data set who’s 79 inches tall. What would their likely shoe size be?

Someone who’s 79 inches tall would have a shoe size of about 15, rounded up. Given that this is a person who is nearly 6'6" / over 2 meters tall, it makes sense that their shoe size will be big!

10. Calculate ŷ for every point on your data set.

You have this formula, now apply it to everything in your data set. Since this is a prediction, you’re going to have a lot of data points with the same results: that’s totally fine.

11. Graph it out!

On the left is a scatter plot of our shoe sizes by height, based on the data set. On the right is our DIY regression line, made with our height information and the rounded ŷ information for shoe size. Technically it should be one straight line, not staggered, but I couldn’t quite figure out how to get it to smooth itself out.

Here’s how Google Sheets does their trendline, which is also another term for a regression line or best fit line, and here’s how it compares to our line:

Not bad! Ours started a little higher than 5 on shoe size while Google’s started at 5, so they calculated regression for information slightly outside our data set — which makes sense since it’s a straight line. But it’s pretty close otherwise.

Here’s what happens when you put the two lines together — it’s a little tricky to do on Google Sheets but this is close:

Awesome! You now know how to make your own forecasts and predictions! You’d probably still use software for this, rather than brute-force your way through, but at least now you know how these numbers and charts came to be. Once again, you can view the data set and charts we used on Google Sheets, including data sets and calculations without rounding and with different roundings, and you should totally send love, admiration, and job offers to Elizabeth (since this was mostly her genius at work anyway).

Part of the #BusinessyBrunette Series | See More in the Series

Enjoyed this? Please support my work by subscribing to my Patreon and getting early access to my #BusinessyBrunette posts as well as other creative work! You can also make a one-off contribution or signal boost my work on social media! Thank you so much!

--

--

Creatrix Tiara

liminality, culture, identity, tech, activism, travel, intersectionality, fandom, arts. signs up for anything that looks interesting. http://creatrixtiara.com