Some examples using Python for simple linear regression modeling and visualization.

Linear regression is a technique for predicting a quantitative response and is a fundamental statistical technique to learn in economics, machine learning, and statistics in general. In economics it used within econometrics to analyze and test hypotheses, such as the income effect, and in machine learning it is one technique of supervised learning.
We will be walking through two simple examples. The first will be more explanatory and is from a sample data set used in an econometrics textbook, named Principles of Econometrics (linked here) which also has examples in R that can be found on the linked website. The second comes from a data set used in the book An Introduction to Statistical Learning, which is a great resource for introductory machine learning techniques and can be downloaded here. To learn more details about regression modeling beyond the scope of this brief article, both of these options are great resources, and I recommend starting with An Introduction to Statistical Learning, which is less detailed, but covers all of the bases of regression analysis. Both have many downloadable datasets that can be used for your own learning and work as well. There are also a lot of online articles that can found elsewhere related to Python and regression analysis. In this article I try to make it as simple as possible with the the least amount of code for both examples (outside the length of the arrays).
Also if one is not familiar with some of the programming terms used, such as arrays, classes, and/or packages, I highly recommend looking at the website realpython.com, which is one of the best websites I have come across in learning Python. I also used Jupyter notebooks for all the Python work (via the Anaconda Data Science Toolkit), which can be viewed in the links at the very end of the article.
Simple Linear Regression
Simple linear regression is a straightforward approach in predictive modeling where the goal is to predict a quantitative response (the dependent variable) based on a single predictor variable (the independent variable). For example, if X increases, Y increases or if X increases, Y decreases. There is some similarity to correlation, but correlation quantifies the strength of the relationship between variables, while regression quantifies the nature of the relationship, or the cause and effect of that relationship.
Mathematically, the simple linear regression model line can be represented as: Y = b₀ + b₁X
Where Y is the dependent variable, b₀ is the Intercept, b₁ is the slope/coefficient, and X is the independent variable.
It’s important, and hopefully obvious, to note that there are many other details that go into simple linear regression equations and analysis mathematically (which I may expand upon and edit in the future within this article or another). In order to keep this article short and to the point, I will leave that for the reader to look into either by looking at prior mentioned resources, or you can also find a very succinct explanation in the book Practical Statistics for Data Scientists.
Writing the code
In order to do linear regression modeling in python a couple packages need to be imported: NumPy and scikit-learn. NumPy, which stands for Number Python, is a python package used for operations on arrays, and scikit-learn is a widely used package for machine learning. So first we import numpy with the alias np, and the scikit-learn class LinearRegression from the sklearn.linear_model like so:
import numpy as np
from sklearn.linear_model import LinearRegression
from
is used for importing the specific sklearn package, linear_model module, and LinearRegression class.
Next we will need to provide our data. In this example we will use a small food sample data set, which can be seen here. The sample data set is what is sometimes called training data in machine learning terms, which is the data used to train or teach or model in making predictions. Below we can see a visual of the data in a scatter plot:

Looking at this, we may assume there is a relationship based on the data points. Income shows various weekly income levels (in $100), while food_exp shows various weekly food expenditure levels (in $). The assumed relationship here is that with increasing income levels there will be increasing food expenditures.
We will need to input this data into an array and assign a variable to represent the data. Below I have assigned our dependent variable (often written as Y or y) and our independent variable (often written as X or x) the list of values of our data set.
income = np.array([3.69, 4.39, 4.75, 6.03, 12.47, 12.98, 14.2, 14.76, 15.32, 16.39, 17.35, 17.77, 17.93, 18.43, 18.55, 18.8, 18.81, 19.04, 19.22, 19.93, 20.13, 20.33, 20.37, 20.43, 21.45, 22.52, 22.55, 22.86, 24.2, 24.39, 24.42, 25.2, 25.5, 26.61, 26.7, 27.14, 27.16, 28.62, 29.4, 33.4,
]).reshape((-1, 1))food_exp = np.array([115.22, 135.98, 119.34, 114.96, 187.05, 243.92, 267.43, 238.71, 295.94, 317.78, 216, 240.35, 386.57, 261.53, 249.34, 309.87, 345.89, 165.54, 196.98, 395.26, 406.34, 171.92, 303.23, 377.04, 194.35, 213.48, 293.87, 259.61, 323.71, 275.02, 109.71, 359.19, 201.51, 460.36, 447.76, 482.55, 438.29, 587.66, 257.95, 375.73,
])print(income)
print(food_exp)
Here you may note a few things. We view our values via print, and our independent variable, income, is a one dimensional array, that contains a reshape function at the end .reshape((-1,1))
. The reshape function converts our one dimensional array into a two-dimensional array with as many rows as possible (-1) with one column (1). This is required in order to do the regression analysis. Our dependent variable, food_exp, can remain a one dimensional array.
Next we create our model variable that will be assigned the value of:
model = LinearRegression().fit(income, food_exp)
What this is essentially doing is we assign the variable named model
by creating an instance of the class LinearRegression()
and then call .fit
which takes our two variables, income and food_exp, as arguments. .fit
is used to calculate the estimators of the model, or the optimal weights of b₀ (our intercept) and b₁(our slope) and find the best “fitted” line to our data points.
Now that the model is fitted we can check the following values: R-squared (R²), the intercept (b₀), and slope (b₁). To summarize:
R-squared (R²), also called the coefficient of determination, is essentially a measure of how close all the sample values from our data set are to the fitted regression equation (the regression line visually). It is the proportion of variance explained by the model, from 0 to 1. The closer R² is to 1 the more closely the sample values fall onto the line, i.e. if R² is equal to 1 then all the sample values fall on the line perfectly. The closer R² is to 0 the further the sample values are from the line, and if equal to 0 then the sample data values are uncorrelated and have no linear association or relationship.
Intercept (b₀): is the predicted value when the independent variable (income in this example) is equal to 0.
Slope (b₁): is the slope of the regression line.
r_squared = model.score(income, food_exp)
print('coefficient of determination:', r_squared)print('intercept:', model.intercept_)
print('slope:', model.coef_)
This will give use the following output, which shows a R² below 0.4, an intercept of roughly $83.4 of weekly food expenditures at $0 income, and a slope of roughly 10.21, meaning for every 1 unit of income, there is roughly 10.21 units of increase in food expenditures:
coefficient of determination: 0.3850022272112529
intercept: 83.41600202075946
slope: [10.20964297]
Next we will complete the predicted response or the predicted values of our dependent variable food_exp:
food_exp_pred = model.predict(income)
print(‘predicted response:’, food_exp_pred, sep=’\n’)
which provides the following output:
predicted response:
[121.08958457 128.23633465 131.91180612 144.98014912 210.73024983
215.93716775 228.39293217 234.11033223 239.82773229 250.75205027
260.55330752 264.84135756 266.47490044 271.57972192 272.80487908
275.35728982 275.45938625 277.80760413 279.64533987 286.89418638
288.93611497 290.97804356 291.38642928 291.99900786 302.41284369
313.33716166 313.64345095 316.80844027 330.48936185 332.42919401
332.7354833 340.69900482 343.76189771 355.0946014 356.01346927
360.50571218 360.70990503 375.61598377 383.57950528 424.41807716]
The predicted response is essentially showing us that given the income values provided in our data set, what would be the predicted value or dependent variable value based on our regression line. For example, for our first income value, 3.69, we could predict at this value ($369 weekly) we would have roughly $121.09 of food expenditures, and then at our second income value, 4.39 ($439 weekly), we would have roughly $128.24 of weekly food expenditures, and so on….
If we wanted to provide a different set of income values we could make predictions based from our model of what the food expenditure amounts may be.
Visualizing our simple linear regression analysis
Now we can visualize our simple linear regression using Python’s Matplotlib library and plot out data points and regression line, along with creating our title, labels, color, axis and size:
import matplotlib.pyplot as pltplt.figure(figsize=(9,7))
plt.scatter(income, food_exp, color = "blue")
plt.plot(income, model.predict(income), color = "black")
plt.title("example")
plt.xlabel("income = weekly income in $100")
plt.ylabel("food_exp = weekly food expenditure in $")
plt.axis([0, 40, 0, 700])
plt.show()
which provides the following output

A quick second example in simple steps
Here we are doing essentially the same thing, using the same process, but instead using the advertising data set provided by the An Introduction to Statistical Learning website, you can also view the data here.
Step 1: import necessary packages
import numpy as np
from sklearn.linear_model import LinearRegression
Step 2: provide and view data
tv = np.array([230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2, 8.6, 199.8, 66.1, 214.7, 23.8, 97.5, 204.1, 195.4, 67.8, 281.4, 69.2, 147.3, 218.4, 237.4, 13.2, 228.3, 62.3, 262.9, 142.9, 240.1, 248.8, 70.6, 292.9, 112.9, 97.2, 265.6, 95.7, 290.7, 266.9, 74.7, 43.1, 228, 202.5, 177, 293.6, 206.9, 25.1, 175.1, 89.7, 239.9, 227.2, 66.9, 199.8, 100.4, 216.4, 182.6, 262.7, 198.9, 7.3, 136.2, 210.8, 210.7, 53.5, 261.3, 239.3, 102.7, 131.1, 69, 31.5, 139.3, 237.4, 216.8, 199.1, 109.8, 26.8, 129.4, 213.4, 16.9, 27.5, 120.5, 5.4, 116, 76.4, 239.8, 75.3, 68.4, 213.5, 193.2, 76.3, 110.7, 88.3, 109.8, 134.3, 28.6, 217.7, 250.9, 107.4, 163.3, 197.6, 184.9, 289.7, 135.2, 222.4, 296.4, 280.2, 187.9, 238.2, 137.9, 25, 90.4, 13.1, 255.4, 225.8, 241.7, 175.7, 209.6, 78.2, 75.1, 139.2, 76.4, 125.7, 19.4, 141.3, 18.8, 224, 123.1, 229.5, 87.2, 7.8, 80.2, 220.3, 59.6, 0.7, 265.2, 8.4, 219.8, 36.9, 48.3, 25.6, 273.7, 43, 184.9, 73.4, 193.7, 220.5, 104.6, 96.2, 140.3, 240.1, 243.2, 38, 44.7, 280.7, 121, 197.6, 171.3, 187.8, 4.1, 93.9, 149.8, 11.7, 131.7, 172.5, 85.7, 188.4, 163.5, 117.2, 234.5, 17.9, 206.8, 215.4, 284.3, 50, 164.5, 19.6, 168.4, 222.4, 276.9, 248.4, 170.2, 276.7, 165.6, 156.6, 218.5, 56.2, 287.6, 253.8, 205, 139.5, 191.1, 286, 18.7, 39.5, 75.5, 17.2, 166.8, 149.7, 38.2, 94.2, 177, 283.6, 232.1]).reshape((-1, 1))sales = np.array([22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2, 4.8, 10.6, 8.6, 17.4, 9.2, 9.7, 19, 22.4, 12.5, 24.4, 11.3, 14.6, 18, 12.5, 5.6, 15.5, 9.7, 12, 15, 15.9, 18.9, 10.5, 21.4, 11.9, 9.6, 17.4, 9.5, 12.8, 25.4, 14.7, 10.1, 21.5, 16.6, 17.1, 20.7, 12.9, 8.5, 14.9, 10.6, 23.2, 14.8, 9.7, 11.4, 10.7, 22.6, 21.2, 20.2, 23.7, 5.5, 13.2, 23.8, 18.4, 8.1, 24.2, 15.7, 14, 18, 9.3, 9.5, 13.4, 18.9, 22.3, 18.3, 12.4, 8.8, 11, 17, 8.7, 6.9, 14.2, 5.3, 11, 11.8, 12.3, 11.3, 13.6, 21.7, 15.2, 12, 16, 12.9, 16.7, 11.2, 7.3, 19.4, 22.2, 11.5, 16.9, 11.7, 15.5, 25.4, 17.2, 11.7, 23.8, 14.8, 14.7, 20.7, 19.2, 7.2, 8.7, 5.3, 19.8, 13.4, 21.8, 14.1, 15.9, 14.6, 12.6, 12.2, 9.4, 15.9, 6.6, 15.5, 7, 11.6, 15.2, 19.7, 10.6, 6.6, 8.8, 24.7, 9.7, 1.6, 12.7, 5.7, 19.6, 10.8, 11.6, 9.5, 20.8, 9.6, 20.7, 10.9, 19.2, 20.1, 10.4, 11.4, 10.3, 13.2, 25.4, 10.9, 10.1, 16.1, 11.6, 16.6, 19, 15.6, 3.2, 15.3, 10.1, 7.3, 12.9, 14.4, 13.3, 14.9, 18, 11.9, 11.9, 8, 12.2, 17.1, 15, 8.4, 14.5, 7.6, 11.7, 11.5, 27, 20.2, 11.7, 11.8, 12.6, 10.5, 12.2, 8.7, 26.2, 17.6, 22.6, 10.3, 17.3, 15.9, 6.7, 10.8, 9.9, 5.9, 19.6, 17.3, 7.6, 9.7, 12.8, 25.5, 13.4
])print(tv)
print(sales)
Step 3: create our model
model = LinearRegression().fit(tv, sales)
Step 4: get our results
r_squared = model.score(tv, sales)
print('coefficient of determination:', r_squared)print('intercept:', model.intercept_)
print('slope:', model.coef_)
Step 5: predict response
sales_pred = model.predict(tv)
print('predicted response:', sales_pred, sep='\n')
Step 6: visualize our data
import matplotlib.pyplot as pltplt.figure(figsize=(9,7))
plt.scatter(tv, sales, color = "blue")
plt.plot(tv, model.predict(tv), color = "black")
plt.title("example")
plt.xlabel("tv")
plt.ylabel("sales")
plt.axis([0, 300, 0, 30])
plt.show()

If one wants to view non-executable versions of the Jupyter notebooks, see the following links: