Data Scientist Must Know — Quick Guide to Linear Regression in RStudio

A quick tutorial about linear regression in RStudio

Insufficient
4 min readJan 20, 2023
Picture from Wikipedia

When it comes to machine learning models, linear regression is the most simple yet effective model available. It’s really old, dating back to the 1800s, yet it is reliable. If you’re just starting out on your data science journey, you would definitely encounter this model. This post aims to introduce you to linear regression using RStudio. The topics covered are the building process and the interpreting the results.

The data

For this tutorial, the data used is a fictional dataset

The Dataset for this post

To import this dataset to RStudio, you can use the following code

y<-c(15,32,40,50,70,16,19,65,42,21)
x1<-c(3,7,9,12,10,5,8,21,10,14)
x2<-c(1.5,4,5.5,4,3,1,6,8,3,8)

df<-data.frame(y,x1,x2)

The dataset will imported to RStudio with the name ‘df’.

Simple Linear Regression

The first model we are going to discuss is the Simple Linear Regression, which uses only one explanatory variable in the model. For this post, the model used will have the formula:

Simple Linear Regression only has one explanatory variable

The code to build this model is as follows:

model1<-lm(y~x1,data=df)

The model is named ‘model1’, after running this code, the model should pop up in the environment tab like this:

Next, we are going to look at the model, to do this, use the following code

summary(model1)

The results will be as follows

The output of ‘model1’

Obviously there won’t be colored squares when you run it! Here is how to read the results of the model:

Black box : The coefficients in the model, since this is a single linear regression model, there are only 2 coefficients (The intercept will always be one of them!).

Red box : The estimates of the coefficients, basically the beta0 and beta1 of the model. Therefore, the model will have the form:

The estimated model

Blue box : The value of the t-statistic of each coefficients from the t-test.

Green box : The P-value of the t-test of each coefficients. It’s easier to interpret the result of the test with the value rather than the t-statistic itself. Here, we can see that x1 is significant!

Orange box : The R-Square score for the model. A score of 0.4033 is quite terrible!

Purple box : The Adjusted R-Square score for the model. Just like the R-square score, 0.3288 is terrible!

Gray box : The F-test results of the model. Notice that the P-value is the same as the P-value of the t-test. This is because there is only one explanatory variable in the model.

Multiple Linear Regression

Next, we are going to create a multiple linear regression, which is basically linear regression with more than one explanatory variable. The model used will have the formula:

Multiple Linear Regression has more than one explanatory variable

The code to build this model is as follows:

model2<-lm(y~x1+x2,data=df)
summary(model2)

The results of the model is as follows:

The output of ‘model2’

The way to read this output is similar to simple linear regression. The formula for this model is:

Using the model to predict

After building our model, naturally the next step is to use it. Let’s say we want to predict y for certain values of x1 and x2. While it’s easy to just substitute the values to the formula, it’s very inefficient. Don’t worry, we can easily do it in RStudio!

For example, let’s say that we want to predict the y value for the following table:

First, let’s input this table into RStudio

x1_new<-c(13,12,11)
x2_new<-c(5,7,4.5)

df_new<-data.frame(x1=x1_new,x2=x2_new)

The new table is called ‘df_new’. Keep in mind that in order to predict using the model, we need to make sure that the name of the columns are the same as the name of the coefficients in the model. Since we use the names x1 and x2 when building the model, we need to give the new columns the same name.

Now, we can predict the y values with the following code

predictions<-predict(model2,df_new)
df_new$predictions_results<-predictions

The first line of code gives you the predictions as a list of values named ‘predictions’, the second line inserts these values to the table ‘df_new’ as the column named ‘predictions_results’. If you look at your table now, it should be something like this:

The predictions of your model!

--

--