Create time series and make prediction of with scikit-learn
A story while diving in one world machine learning with Python
My adventure in a world of machine learning is started less than three months ago with the help of two famous AI-assistants the Bing Chat and ChatGPT (Viridi, 2023). Then it is leveled up a little bit by implementing Python virtual environment in installing required packages about two days ago (Viridi, 2024) to assure that the environment is reproducible in other machines. In this story I will follow a tutorial how to use scikit-learn to make a prediction from self-created dataset (Ebner, 2023), where in this story the datasets are different and you can make you own datasets too.
Packages
In this story every lines of code are begin with importing following packages.
# load packages
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
You might need to change the model if you want to use other than LinearRegression
model, but perhaps later.
Linear data
In order to understand functions provided in scikit-learn nearly the simplest data, a linear data is used in this part, where the most simplest one is constant data. First step to do is to create the dataset.
# create dataset
obs_count = 101
x_var = np.linspace(start=0, stop=100, num=obs_count)
y_var = 5 * x_var + 10
Then it is a good practice to display it to assure that the data is what you want.
# display dataset
fig = sns.scatterplot(x=x_var, y=y_var)
fig.get_xaxis().set_label_text('x')
fig.get_yaxis().set_label_text('y')
plt.grid()
plt.show()
It shows a linear data consisting of 101 data points from 0 to 101 with equal space of 1.
Next step is to split the dataset into train and test data using following lines of code
from sklearn.model_selection import train_test_split
(X_train, X_test, y_train, y_test) = train_test_split(x_var, y_var, test_size=.2)
print(len(X_train), len(X_test))
which produced 80
and 21
as number of train and test data, since we use test_size=.2
as a keyword argument of train_test_split()
function. Then plot the train data.
# display train data from dataset
fig = sns.scatterplot(x=X_train, y=y_train)
fig.get_xaxis().set_label_text('x_train')
fig.get_yaxis().set_label_text('y_train')
plt.grid()
plt.show()
Notice that some missing points in the train data, which should be about 20% that are allocated as test data.
Plot also the test data.
# display test data from dataset
fig = sns.scatterplot(x=X_test, y=y_test)
fig.get_xaxis().set_label_text('x_test')
fig.get_yaxis().set_label_text('y_test')
plt.grid()
plt.show()
The test data points are 20% of data chosen randomly from original dataset.
It is time to initialize the model, where we use for now, LinearRegression
model.
linear_regressor = LinearRegression()
Then fit the model using train data.
linear_regressor.fit(X_train.reshape(-1,1), y_train)
Notice that X_train
has been reshaped into a 2-dimensional array, since the fit method and predict method expect 2D input arrays (Ebner, 2022). Value of -1
indicate that we do not know number of rows and let Python handle that.
After the model is built, we can use it to predict, which produces y_predict
from X_test
.
y_predict = linear_regressor.predict(X_test.reshape(-1,1))
To assure it predicts as we expected, compare the results with y_test
data points.
# compare predicted data and test data
plt.figure(figsize=(4.8, 4.8))
plt.grid()
fig = sns.scatterplot(x=y_predict, y=y_test)
fig.get_xaxis().set_label_text('y_test')
fig.get_yaxis().set_label_text('y_predict')
plt.show()
You can see that they are all aligned on the line y_predict = y_test
indicating that the prediction is very good. And it must be since we use linear data.
Linear data with normal random noise
Now we can introduce normal random noise to the linear data. Details that are similar to the process in producing linear data will omitted. Only the difference will be shown to spare the space and your time in reading this story.
# create dataset
obs_count = 101
x_var = np.linspace(start=0, stop=100, num=obs_count)
mean = 0
stddev = 10
y_var = 5 * x_var + 10 + np.random.normal(loc=mean, scale=stddev, size=obs_count)
The purpose of above code is for creating dataset with linear data perturbed by normal random noise with mean mean
and standard deviation stddev
. Relation between y and x is actually the same as in previous part in this story, but now it is perturbed with normal random noise.
And the final results show that the predicted data points are not aligned perfectly along the line y_predict = y_rest
due to the introduction of normal random noise to the data.
Quadratic and cubic data
Let us now play a little bit when y is function of quadratic and cubic polynomial of x. For quadratic data following are the steps.
# create dataset
obs_count = 101
x_var = np.linspace(start=0, stop=100, num=obs_count)
mean = 0
stddev = 0
y_var = 2.5 - 0.001 * (x_var - 100) * x_var + np.random.normal(loc=mean, scale=stddev, size=obs_count)
Data points produced by above code are on the left, train data in the middle, and test data on the right. After creating the data, build the model through pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
model = make_pipeline(PolynomialFeatures(2), LinearRegression())
and fit the test data as previously
model.fit(X_train.reshape(-1,1), y_train)
It is still linear regression since it is still linear in the coefficients (jakevdp, 2015).
The pipeline shown in your Jupyter Notebook indicates the process performed during training the model.
And steps for generating cubic data are similar.
# create dataset
obs_count = 101
x_var = np.linspace(start=0, stop=100, num=obs_count)
mean = 0
stddev = 0
y_var = 0.6 + (1/80000)*(x_var - 10)*(x_var - 50)*(x_var - 90) \
+ np.random.normal(loc=mean, scale=stddev, size=obs_count)
Above figure shows data points produced using previous code, which is on the left, while train data is in the middle and test data is on the right. Then, build the model through pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
model = make_pipeline(PolynomialFeatures(3), LinearRegression())
fit the data
model.fit(X_train.reshape(-1,1), y_train)
and test it.
It shows that the prediction data fit with the test data.
There is also another way in using linear regression for polynomial function, which known as polynomial regression (Agrawal, 2023).
Summaries
After reading this story you are able to
- create datasets for linear, quadratic, and cubic data,
- add normal random noise to a dataset,
- split dataset to train and test data,
- use linear regression to predict linear, quadratic, and cubic data using scikit-learn pipeline with polynomial feature,
- predict dependent data points from given independent data points.
- Agrawal R (2023) “Master Polynomial Regression With Easy-to-Follow Tutorials”, Analytics Vidhya, 17 Nov, url https://www.analyticsvidhya.com/blog/2021/07/all-you-need-to-know-about-polynomial-regression/ [20240209].
- Ebner J (2022) “How to Use the Sklearn Predict Method”, Shart Sight, 2 May, url https://www.sharpsightlabs.com/blog/sklearn-predict/ [20240209].
- jakevdp (2015) “Linear Regression with quadratic terms”, Stack Overflow, 14 Nov, url https://stackoverflow.com/a/33712121/9475509 [20240209].
- Viridi S (2023) “Binary Classification in Machine Learning as Suggested by AI-Assistants”, Medium, 17 Nov, url https://medium.com/p/c78a72d1c9eb [20240209].
- Viridi S (2024) “Install Pandas, Matplotlib, Jupyter Notebook, Scikit-Learn, Seaborn in Python virtual environment”, Towards Dev — Medium, 8 Feb, url https://medium.com/p/c625a5dd25df [20240209].