Stock Prediction using Regression Algorithm in Python
An end-to-end explanation on using ML algorithms to predict stock prices
For this exercise, I will use the Yfinance library for scrapping information right away from Yahoo Finance website, Yahoo Finance is a great website that gives a quick glimpse of listed equities, funds, and investment data for the company you are looking for. Also, I will test and predict certain parameters and stock prices by implementing regression model analysis on the data.
Importing all necessary libraries
Firstly, we will the required libraries for this exercise to be executed successfully. The main library to call and pay attention to here is yfinance. This library will enable us to extract and call different data from the Yahoo website.
!pip install yfinanceCollecting yfinance
Downloading yfinance-0.1.59.tar.gz (25 kB)
Requirement already satisfied: pandas>=0.24 in /opt/conda/lib/python3.7/site-packages (from yfinance) (1.2.2)
Requirement already satisfied: numpy>=1.15 in /opt/conda/lib/python3.7/site-packages (from yfinance) (1.19.5)
Requirement already satisfied: requests>=2.20 in /opt/conda/lib/python3.7/site-packages (from yfinance) (2.25.1)
Downloading multitasking-0.0.9.tar.gz (8.1 kB)
Requirement already satisfied: lxml>=4.5.1 in /opt/conda/lib/python3.7/site-packages (from yfinance) (4.6.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.24->yfinance) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.24->yfinance) (2021.1)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas>=0.24->yfinance) (1.15.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (1.26.3)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (2020.12.5)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (3.0.4)
Building wheels for collected packages: yfinance, multitasking
Building wheel for yfinance (setup.py) ... - \ done
Created wheel for yfinance: filename=yfinance-0.1.59-py2.py3-none-any.whl size=23442 sha256=33c81ce98e6e86be0df9c086ce8b9f54138a14a1079c79d74003a7cfc00b8974
Stored in directory: /root/.cache/pip/wheels/26/af/8b/fac1b47dffef567f945641cdc9b67bb25fae5725d462a8cf81
Building wheel for multitasking (setup.py) ... - done
Created wheel for multitasking: filename=multitasking-0.0.9-py3-none-any.whl size=8368 sha256=c14a0494e534aacbc1170fdaed133fb002cae59c00456e90086fe6970f5dd186
Stored in directory: /root/.cache/pip/wheels/ae/25/47/4d68431a7ec1b6c4b5233365934b74c1d4e665bf5f968d363a
Successfully built yfinance multitasking
Installing collected packages: multitasking, yfinance
Successfully installed multitasking-0.0.9 yfinance-0.1.59
# Let's start with calling all dependencies that we will use for this exercise
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.model_selection import train_test_split
import yfinance as yf # We will use this library to upload latest data from Yahoo API
Data collection and exploration
In this section, We will use certain methods to call data related to company description, stock prices, close prices, volumes, and corporate actions associated with the stock.
# define the ticker you will use
nio = yf.Ticker('NIO')
#Display stock information, it will give you a summary description of the ticker
'sector': 'Consumer Cyclical',
'longBusinessSummary': 'NIO Inc. designs, develops, manufactures, and sells smart electric vehicles in Mainland China, Hong Kong, the United States, the United Kingdom, and Germany. The company offers five, six, and seven-seater electric SUVs, as well as smart electric sedans. It is also involved in the provision of energy and service packages to its users; marketing, design, and technology development activities; manufacture of e-powertrains, battery packs, and components; and sales and after sales management activities. In addition, the company offers power solutions, including Power Home, a home charging solution; Power Swap, a battery swapping service; Public Charger, a public fast charging solution; Power Mobile, a mobile charging service through charging vans; Power Map, an application that provides access to a network of public chargers and their real-time information; and One Click for Power valet service, where it offers vehicle pick up, charging, and return services. Further, it provides repair, maintenance, and bodywork services through its NIO service centers and authorized third-party service centers; statutory and third-party liability insurance, and vehicle damage insurance through third-party insurers; courtesy car services; and roadside assistance, as well as data packages; and auto financing services. Additionally, the company offers NIO Certified, an used vehicle inspection, evaluation, acquisition, and sales service. NIO Inc. has a strategic collaboration with Mobileye N.V. for the development of automated and autonomous vehicles for consumer markets. The company was formerly known as NextEV Inc. and changed its name to NIO Inc. in July 2017. NIO Inc. was founded in 2014 and is headquartered in Shanghai, China.',
'phone': '86 21 6908 2018',
'address1': 'Building 20',
'industry': 'Auto Manufacturers',
'address2': 'No. 56 AnTuo Road Anting Town Jiading District',
'shortName': 'NIO Inc.',
'longName': 'NIO Inc.',
Data preparation and cleaning for Regression analysis
The data from Yahoo Finance is straightforward and reflects real-time data of the stock market, therefore cleaning and processing the exported data will not be a difficult task.
history = nio.history(period="Max")
df = pd.DataFrame(history)
# defining x and y
x = df.index
y = df['Close']
Name: Close, Length: 649, dtype: float64
# Data Exploration
# i like to set up a plot function so i can reuse it at later stages of this analysis
def df_plot(data, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
plt.plot(x, y, color='tab:red')
plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
title = (stock_name,"History stock performance till date")
df_plot(df , x , y , title=title,xlabel='Date', ylabel='Value',dpi=100)
# Data Processing and scaling
df.reset_index(inplace=True) # to reset index and convert it to column
df.drop(columns=['divs','split']).head(2) # We are dropping un necessary columns from the set
df['date'] = pd.to_datetime(df.date)
x = df[['open', 'high','low', 'vol']]
y = df['close']
For this data, I’ve split the data into training and test datasets with a test size of 15% of the total dataset. Afterward, we can simply check if the data was split successfully by using the shape() method.
# Linear regression Model for stock prediction
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.15 , shuffle=False,random_state = 0)
# let's check if total observation makes sense
Regression algorithm model implementation
Before we get to the technical part of implementing the regression model to the dataset, let’s talk a bit about the regression algorithm. Basically, Regression is a set of techniques for estimating relationships. For example in real life, we can relate the force for stretching a spring and the distance that the spring stretches (the likes in Hooke’s law), or explain how many transistors the semiconductor industry can pack into a circuit over time (Moore’s law).
The equation for linear regression can be written as follows:
from IPython.display import Image
print("**Linear Regression Formula**")
Image(filename="../input/stock-prediction-using-regression-algorithm/1.JPG", width= 250, height=100)**Linear Regression Formula**
Image(filename="../input/stock-prediction-using-regression-algorithm/2.JPG", width= 280, height=110)**Regression Formula**
Where, x1, x2,….xn represents the independent variables while the coefficients θ1, θ2, θn represent the weights.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import confusion_matrix, accuracy_score
regression = LinearRegression()
print("regression intercept",regression.intercept_)regression coefficient [-6.51840470e-01 8.48419125e-01 8.12048390e-01 -3.50557805e-10]
regression intercept -0.0315814559475216
Prediction and Estimation
The coefficient of determination R²
Here we will compute the coefficient of determination denoted by R², which takes values between 0 and 1, the higher the value R² the more successful the linear regression is at explaining the variation of Y values, in our case the Y values represent the close stock prices of the subjected company. The below is the math behind The coefficient of determination R²
print("**List of equations**")
Image(filename="../input/stock-prediction-using-regression-algorithm/3.JPG", width= 400, height=250)**List of equations**
# the coefficient of determination R²
regression_confidence = regression.score(test_x, test_y)
print("linear regression confidence: ", regression_confidence)linear regression confidence: 0.9836914831421212
The coefficient of determination R² for our data is at 0.98 which’s 98%, which means that our model is a linear model that explains the variation of all Y values.
As we can see below, the predicted list of data points from open, high, low, and vol are not sorted based on time or date, at this point It’s not important to sort these data point, as we will plot is based on their associated dates using scatter plot() method.
print(test_x.head())open high low vol
551 45.750000 46.720001 42.500000 271678300
552 45.360001 48.919998 44.680000 233779100
553 48.270000 50.590000 47.880001 209106300
554 50.860001 55.700001 50.480000 270203000
555 56.990002 57.200001 51.500000 243669700
Prediction Table of Actual Prices vs Predicted values
The below table displays a summary statistics values of actual values vs predicted values of the dataset
MAE and RMSE are the most common statistical metrics used to measure continuous variables or in our case the accuracy of our regression models.
The math behind both Models might be confusing or a bit mouthful to absorb its meaning, but think about it in this easy way, We have actual stock close prices and predicted stock prices computed from the same actual stock prices we talked about, now we need to calculate the error or the difference between them to see how accurate these prediction compared to the actual values at hand.
Mean Absolute Error (MAE):
MAE measures the average magnitude of the errors in a set of predictions, without considering their direction.
print("**Mean Absolute Error (MAE)**")
Image(filename="../input/stock-prediction-using-regression-algorithm/4.JPG", width= 400, height=250)**Mean Absolute Error (MAE)**
Root mean squared error (RMSE):
RMSE is a quadratic scoring rule that also measures the average magnitude of the error.
print("**Root mean squared error (RMSE)**")
Image(filename="../input/stock-prediction-using-regression-algorithm/5.JPG", width= 400, height=250)**Root mean squared error (RMSE)**
Mean squared error (MSE) :
print("** Mean squared error (MSE)**")
Image(filename="../input/stock-prediction-using-regression-algorithm/6.JPG", width= 400, height=250)** Mean squared error (MSE)**
MSE Mean squared error (MSE) measures the average of the squares of the errors — that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss.
All mentioned metrics above can range from 0 to ∞ and are indifferent to the direction of errors. They are negatively-oriented scores, which means the lower values they present the better. Remember that RMSE will always be larger in value than MSE, Also it can penalize more error-related data so RMSE can be a better measure than MSE.
In our case our evaluation results are mentioned as following :
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(test_y, predicted))
print('Mean Squared Error (MSE) :', metrics.mean_squared_error(test_y, predicted))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(test_y, predicted)))Mean Absolute Error (MAE): 0.7581175544856527
Mean Squared Error (MSE) : 1.001586723642404
Root Mean Squared Error (RMSE): 1.000793047359145
All of our metric results are showing values less than 1, from an interpretation standpoint, I think MAE is a better metric measurement for linear problems than RMSE, as RMSE does not describe average error alone and has other implications that are more difficult to tease out and understand. Also, RMSE gives much more importance to large errors, so models will try to minimize these as much as possible.
x2 = dfr.Actual_Price.mean()
y2 = dfr.Predicted_Price.mean()
Accuracy1 = x2/y2*100
print("The accuracy of the model is " , Accuracy1)The accuracy of the model is 99.68318915929602
plt.scatter(dfr.Actual_Price, dfr.Predicted_Price, color='Darkblue')
plt.title("Nio prediction chart")
The stock market has been always the hottest topic when it comes to time series forecasting or trying to feel where the market is going overall. It’s impossible to find “to go to” formula to predict the direction of the stock market, because of constant volatility of the market, the uncertainty of moving variables that could impact the stock market volatility from associated risk to political instability and Macroeconomic factors, well the list could go on.
To have better visibility on where the market is going, relying on regression models and predicting certain values based on past performance is not good enough. The following points should complement a full-fledged regression model report.
1- Fundamental analysis
Fundamental analysis is a method to analyze and predict the company’s intrinsic value based on historical and current performance data, these data are in form of financial statements and balance sheet information. Hence, information can be analyzed to compute the company’s current multiples such as P/E, P/B, liquidity ratios, debt ratios, Return ratios, Margins, etc. This information can give you a solid conviction on the direction of the company and help you make critical decisions either to consider investing in the company or not.
2- Technical Analysis
Technical analysis is the method of using statistical methods and trends based on historical data, for example, daily total volume or value of a traded stock, and evaluate historical patterns to predict future stock price movement.
3- Sentiment Analysis
Basically, Sentiment Analysis is the use of high-end Natural language processing to determine whether the given textual data is positive, negative, or neutral. You might conduct this analysis in paragraphs, a large set of writing textual data, reviews from your customer, research thesis, scientific papers, etc. In our case, you might use this method to analyze a Twitter account for the subject company or review from its Facebook account and so on.
About the Writer : Abdalla A. Mahgoub
Data Scientist | Investment Ops Analyst | Data Science Enthusiast | ML | Big Data | Python | SQL | FinTech |Strategic Planner | Business developer |Speaker | Writer | Full Stack developer and a UX designer