Building A Model with Only Two Columns X and Y using python

Derrick Dadson
4 min readJan 14, 2023

--

Ever wondered how Youtube recommends videos to a user , how Tiktok recommends videos to a user on their FYP?

If No → we all learnt Correlation and Regression / scatter plot in high school and at our level we were only limited to two variables X and Y 😂

EPISODE1

A typical example of what I was talking about is illustrated in the above diagram

In the above diagram we have two Variables
x — > hours(h)
y → dis(d)

Step 1: Take any two points on the line; preferably the points that are at the intersection of the grid.

Let’s take two points on the line. We took points S and R as shown below.

  • The coordinates of S are (4,4).
  • The coordinates of R are (8,6).

Step 2:Find the slope of the line by substituting the value of the coordinates in the equation.

  • Let the coordinates of R (8,6) be (x2,y2).
  • Let the coordinates of S (4,4) be (x1,y1).

Substituting the values in the formula to find the slope:

slope(m)=y2-y2/x2-x1

therefore m=6–4/8–4

m=1/2

Step 3: Find the y-intercept.

Take the coordinates of any point S or R we chose earlier on the line. The answer will be the same.

Let’s take the coordinates of point R(8, 6).

The value of slope (m) is 1/2 as found earlier in step 2.

Let’s substitute the values in the equation to find the value of y-intercept (b).

y=mx+c
where m=1/2,x=8,y=6

so we can now solve for c:

c=2

Step 4: Find the equation of a given line of best fit for the given scatter plot.

since we have the all the values for the variables for the equation y=mx+c

therefore →

Equation of a line of best fit is y=12x+2

— -NOW IN PYTHON —

Libraries we are going to use are :
pandas —> to read the dataset
sklearn → to create the model and to perform other important functions
matplotlib → to visualize the data

dataset.csv is →

so we start by importing them into our python file .py

Importing Libraries

import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np

now we are suppose to define a variable called model and assign the to it

LinearRegression object from the sklearn class

model=LinearRegression()

Loading the dataset

as I stated above we use pandas for importing or loading the dataset from our directory and this is an example :

df=pd.read_csv('dataset.csv')

Defining X and Y varibales and plotting it on a grap using MatplotLib

x=df['x']
y=df['y']

plt.scatter(x,y)
plt.show()

in the above code we get the x data and y data assign it to a variable using pandas and later pass it to matplotlib to do the plotting of graph below is the generated graph by matplotlib

which is kinda same as compared to the initial graph

Redefining X and Y variables

x=np.array(df['x']).reshape(-1,1)
y=np.array(df['y'])

Now, you have two arrays: the input, x, and the output, y. You should call .reshape() on x because this array must be two-dimensional, or more precisely, it must have one column and as many rows as necessary. That’s exactly what the argument (-1, 1) of .reshape() specifies.

Creating the model

model = LinearRegression().fit(x, y)

The next step is to create a linear regression model and fit it using the existing data (Line of Best fit 😂)

Getting the gradient/slope and the y-intercept

since the model has been trained using x and y we can get the y-intercept and gradient using

print(f"y-intercept: {model.intercept_}")
print(f"slope: {model.coef_}")


##Output##

#y-intercept: 2.2042913956284487
#slope: [0.45028825]

The Equation

since we have the values for the y-intercept and the slope we can find or generate the formula using
y=mx+c

but in python code we can say :

y_pred = model.intercept_ + model.coef_ * x

finally Let’s do some predictions 🎉

#10 hours journey
y_pred = model.intercept_ + model.coef_ * 10
output
#[6.70717391] which is 6.7KM

#7 hours journey
y_pred = model.intercept_ + model.coef_ * 7
output
#[5.35630915] which is 5.4KM (≈)

Thank you for reading

Thanks for having time for this tutorial this is my first time so next time
we go lit

Everything Together

import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np


model=LinearRegression()
df=pd.read_csv('dataset.csv')

x=np.array(df['x']).reshape(-1,1)
y=np.array(df['y'])

model = LinearRegression().fit(x, y)

print(f"y-intercept: {model.intercept_}")
print(f"slope: {model.coef_}")

y_pred = model.intercept_ + model.coef_ * 7

print(y_pred)

###@copyright Derrick Dadson

--

--