# Stock Prediction using Regression Algorithm in Python

## An end-to-end explanation on using ML algorithms to predict stock prices

For this exercise, I will use the Yfinance library for scrapping information right away from Yahoo Finance website, Yahoo Finance is a great website that gives a quick glimpse of listed equities, funds, and investment data for the company you are looking for. Also, I will test and predict certain parameters and stock prices by implementing regression model analysis on the data.

# Importing all necessary libraries

Firstly, we will the required libraries for this exercise to be executed successfully. The main library to call and pay attention to here is yfinance. This library will enable us to extract and call different data from the Yahoo website.

In [1]:

!pip install yfinanceCollecting yfinance

Downloading yfinance-0.1.59.tar.gz (25 kB)

Requirement already satisfied: pandas>=0.24 in /opt/conda/lib/python3.7/site-packages (from yfinance) (1.2.2)

Requirement already satisfied: numpy>=1.15 in /opt/conda/lib/python3.7/site-packages (from yfinance) (1.19.5)

Requirement already satisfied: requests>=2.20 in /opt/conda/lib/python3.7/site-packages (from yfinance) (2.25.1)

Collecting multitasking>=0.0.7

Downloading multitasking-0.0.9.tar.gz (8.1 kB)

Requirement already satisfied: lxml>=4.5.1 in /opt/conda/lib/python3.7/site-packages (from yfinance) (4.6.3)

Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.24->yfinance) (2.8.1)

Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.24->yfinance) (2021.1)

Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas>=0.24->yfinance) (1.15.0)

Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (1.26.3)

Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (2.10)

Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (2020.12.5)

Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (3.0.4)

Building wheels for collected packages: yfinance, multitasking

Building wheel for yfinance (setup.py) ... - \ done

Created wheel for yfinance: filename=yfinance-0.1.59-py2.py3-none-any.whl size=23442 sha256=33c81ce98e6e86be0df9c086ce8b9f54138a14a1079c79d74003a7cfc00b8974

Stored in directory: /root/.cache/pip/wheels/26/af/8b/fac1b47dffef567f945641cdc9b67bb25fae5725d462a8cf81

Building wheel for multitasking (setup.py) ... - done

Created wheel for multitasking: filename=multitasking-0.0.9-py3-none-any.whl size=8368 sha256=c14a0494e534aacbc1170fdaed133fb002cae59c00456e90086fe6970f5dd186

Stored in directory: /root/.cache/pip/wheels/ae/25/47/4d68431a7ec1b6c4b5233365934b74c1d4e665bf5f968d363a

Successfully built yfinance multitasking

Installing collected packages: multitasking, yfinance

Successfully installed multitasking-0.0.9 yfinance-0.1.59

In [2]:

*# Let's start with calling all dependencies that we will use for this exercise *

import pandas as pd

import numpy as np

import math

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn import metrics

from sklearn.model_selection import train_test_split

import yfinance as yf *# We will use this library to upload latest data from Yahoo API*

%matplotlib inline

plt.style.use('fivethirtyeight')

# Data collection and exploration

In this section, We will use certain methods to call data related to company description, stock prices, close prices, volumes, and corporate actions associated with the stock.

In [3]:

*# define the ticker you will use*

nio = yf.Ticker('NIO')

*#Display stock information, it will give you a summary description of the ticker*

nio.info

Out[3]:

`{'zip': '201804',`

'sector': 'Consumer Cyclical',

'fullTimeEmployees': 7763,

'longBusinessSummary': 'NIO Inc. designs, develops, manufactures, and sells smart electric vehicles in Mainland China, Hong Kong, the United States, the United Kingdom, and Germany. The company offers five, six, and seven-seater electric SUVs, as well as smart electric sedans. It is also involved in the provision of energy and service packages to its users; marketing, design, and technology development activities; manufacture of e-powertrains, battery packs, and components; and sales and after sales management activities. In addition, the company offers power solutions, including Power Home, a home charging solution; Power Swap, a battery swapping service; Public Charger, a public fast charging solution; Power Mobile, a mobile charging service through charging vans; Power Map, an application that provides access to a network of public chargers and their real-time information; and One Click for Power valet service, where it offers vehicle pick up, charging, and return services. Further, it provides repair, maintenance, and bodywork services through its NIO service centers and authorized third-party service centers; statutory and third-party liability insurance, and vehicle damage insurance through third-party insurers; courtesy car services; and roadside assistance, as well as data packages; and auto financing services. Additionally, the company offers NIO Certified, an used vehicle inspection, evaluation, acquisition, and sales service. NIO Inc. has a strategic collaboration with Mobileye N.V. for the development of automated and autonomous vehicles for consumer markets. The company was formerly known as NextEV Inc. and changed its name to NIO Inc. in July 2017. NIO Inc. was founded in 2014 and is headquartered in Shanghai, China.',

'city': 'Shanghai',

'phone': '86 21 6908 2018',

'country': 'China',

'companyOfficers': [],

'website': 'http://www.nio.com',

'maxAge': 1,

'address1': 'Building 20',

'industry': 'Auto Manufacturers',

'address2': 'No. 56 AnTuo Road Anting Town Jiading District',

'previousClose': 38.12,

'regularMarketOpen': 37.96,

'twoHundredDayAverage': 43.683704,

'trailingAnnualDividendYield': None,

'payoutRatio': 0,

'volume24Hr': None,

'regularMarketDayHigh': 38,

'navPrice': None,

'averageDailyVolume10Day': 70446940,

'totalAssets': None,

'regularMarketPreviousClose': 38.12,

'fiftyDayAverage': 41.888824,

'trailingAnnualDividendRate': None,

'open': 37.96,

'toCurrency': None,

'averageVolume10days': 70446940,

'expireDate': None,

'yield': None,

'algorithm': None,

'dividendRate': None,

'exDividendDate': None,

'beta': 2.613181,

'circulatingSupply': None,

'startDate': None,

'regularMarketDayLow': 36.76,

'priceHint': 2,

'currency': 'USD',

'regularMarketVolume': 52839807,

'lastMarket': None,

'maxSupply': None,

'openInterest': None,

'marketCap': 60854632448,

'volumeAllCurrencies': None,

'strikePrice': None,

'averageVolume': 97445091,

'priceToSalesTrailing12Months': 3.743073,

'dayLow': 36.76,

'ask': 37.43,

'ytdReturn': None,

'askSize': 3000,

'volume': 52839807,

'fiftyTwoWeekHigh': 66.99,

'forwardPE': -371.4,

'fromCurrency': None,

'fiveYearAvgDividendYield': None,

'fiftyTwoWeekLow': 2.88,

'bid': 37.21,

'tradeable': False,

'dividendYield': None,

'bidSize': 1400,

'dayHigh': 38,

'exchange': 'NYQ',

'shortName': 'NIO Inc.',

'longName': 'NIO Inc.',

'exchangeTimezoneName': 'America/New_York',

'exchangeTimezoneShortName': 'EDT',

'isEsgPopulated': False,

'gmtOffSetMilliseconds': '-14400000',

'quoteType': 'EQUITY',

'symbol': 'NIO',

'messageBoardId': 'finmb_311626862',

'market': 'us_market',

'annualHoldingsTurnover': None,

'enterpriseToRevenue': 2.107,

'beta3Year': None,

'profitMargins': -0.34511003,

'enterpriseToEbitda': -9.618,

'52WeekChange': 11.664452,

'morningStarRiskRating': None,

'forwardEps': -0.1,

'revenueQuarterlyGrowth': None,

'sharesOutstanding': 1638520064,

'fundInceptionDate': None,

'annualReportExpenseRatio': None,

'bookValue': 17.315,

'sharesShort': 65406021,

'sharesPercentSharesOut': 0.0399,

'fundFamily': None,

'lastFiscalYearEnd': 1609372800,

'heldPercentInstitutions': 0.36421,

'netIncomeToCommon': -5610789888,

'trailingEps': -0.724,

'lastDividendValue': None,

'SandP52WeekChange': 0.45070732,

'priceToBook': 2.1449609,

'heldPercentInsiders': 0.00533,

'nextFiscalYearEnd': 1672444800,

'mostRecentQuarter': 1609372800,

'shortRatio': 0.53,

'sharesShortPreviousMonthDate': 1614297600,

'floatShares': 1327183283,

'enterpriseValue': 34251128832,

'threeYearAverageReturn': None,

'lastSplitDate': None,

'lastSplitFactor': None,

'legalType': None,

'lastDividendDate': None,

'morningStarOverallRating': None,

'earningsQuarterlyGrowth': None,

'dateShortInterest': 1617148800,

'pegRatio': 4298.73,

'lastCapGain': None,

'shortPercentOfFloat': None,

'sharesShortPriorMonth': 48084071,

'impliedSharesOutstanding': None,

'category': None,

'fiveYearAverageReturn': None,

'regularMarketPrice': 37.14,

'logo_url': 'https://logo.clearbit.com/nio.com'}

# Data preparation and cleaning for Regression analysis

The data from Yahoo Finance is straightforward and reflects real-time data of the stock market, therefore cleaning and processing the exported data will not be a difficult task.

In [4]:

`history = nio.history(period="Max")`

df = pd.DataFrame(history)

df.head(10)

Out[4]:

In [5]:

*# defining x and y *

x = df.index

y = df['Close']

y

Out[5]:

`Date`

2018-09-12 6.600000

2018-09-13 11.600000

2018-09-14 9.900000

2018-09-17 8.500000

2018-09-18 7.680000

...

2021-04-06 40.000000

2021-04-07 37.270000

2021-04-08 38.700001

2021-04-09 38.119999

2021-04-12 37.139999

Name: Close, Length: 649, dtype: float64

In [6]:

*# Data Exploration*

*# i like to set up a plot function so i can reuse it at later stages of this analysis *

def df_plot(data, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):

plt.figure(figsize=(16,5), dpi=dpi)

plt.plot(x, y, color='tab:red')

plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)

plt.show()

In [7]:

`stock_name= "NIO"`

title = (stock_name,"History stock performance till date")

df_plot(df , x , y , title=title,xlabel='Date', ylabel='Value',dpi=100)

In [8]:

*# Data Processing and scaling*

df.reset_index(inplace=True) *# to reset index and convert it to column*

In [9]:

`df.head(2)`

Out[9]:

In [10]:

`df.columns ['date','open','high','low','close','vol','divs','split']`

In [11]:

`df.drop(columns=['divs','split']).head(2) `*# We are dropping un necessary columns from the set*

Out[11]:

In [12]:

`df['date'] = pd.to_datetime(df.date)`

In [13]:

`df.describe()`

Out[13]:

In [14]:

print(len(df))649

In [15]:

`x = df[['open', 'high','low', 'vol']]`

y = df['close']

# Data Split

For this data, I’ve split the data into training and test datasets with a test size of 15% of the total dataset. Afterward, we can simply check if the data was split successfully by using the shape() method.

In [16]:

*# Linear regression Model for stock prediction *

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.15 , shuffle=False,random_state = 0)

In [17]:

# let's check if total observation makes sense

print(train_x.shape )

print(test_x.shape)

print(train_y.shape)

print(test_y.shape)(551, 4)

(98, 4)

(551,)

(98,)

# Regression algorithm model implementation

Before we get to the technical part of implementing the regression model to the dataset, let’s talk a bit about the regression algorithm. Basically, Regression is a set of techniques for estimating relationships. For example in real life, we can relate the force for stretching a spring and the distance that the spring stretches (the likes in Hooke’s law), or explain how many transistors the semiconductor industry can pack into a circuit over time (Moore’s law).

The equation for linear regression can be written as follows:

In [18]:

import os

from IPython.display import Image

print("**Linear Regression Formula**")

Image(filename="../input/stock-prediction-using-regression-algorithm/1.JPG", width= 250, height=100)**Linear Regression Formula**

Out[18]:

Also

In [19]:

print("**Regression Formula**")

Image(filename="../input/stock-prediction-using-regression-algorithm/2.JPG", width= 280, height=110)**Regression Formula**

Out[19]:

Where, x1, x2,….xn represents the independent variables while the coefficients θ1, θ2, θn represent the weights.

In [20]:

from sklearn.linear_model import LinearRegression

from sklearn.metrics import confusion_matrix, accuracy_score

regression = LinearRegression()

regression.fit(train_x, train_y)

print("regression coefficient",regression.coef_)

print("regression intercept",regression.intercept_)regression coefficient [-6.51840470e-01 8.48419125e-01 8.12048390e-01 -3.50557805e-10]

regression intercept -0.0315814559475216

# Prediction and Estimation

# The coefficient of determination R²

Here we will compute the coefficient of determination denoted by R², which takes values between 0 and 1, the higher the value R² the more successful the linear regression is at explaining the variation of Y values, in our case the Y values represent the close stock prices of the subjected company. The below is the math behind The coefficient of determination R²

In [21]:

print("**List of equations**")

Image(filename="../input/stock-prediction-using-regression-algorithm/3.JPG", width= 400, height=250)**List of equations**

Out[21]:

In [22]:

# the coefficient of determination R²

regression_confidence = regression.score(test_x, test_y)

print("linear regression confidence: ", regression_confidence)linear regression confidence: 0.9836914831421212

The coefficient of determination R² for our data is at 0.98 which’s 98%, which means that our model is a linear model that explains the variation of all Y values.

# Prediction

As we can see below, the predicted list of data points from open, high, low, and vol are not sorted based on time or date, at this point It’s not important to sort these data point, as we will plot is based on their associated dates using scatter plot() method.

In [23]:

predicted=regression.predict(test_x)

print(test_x.head())open high low vol

551 45.750000 46.720001 42.500000 271678300

552 45.360001 48.919998 44.680000 233779100

553 48.270000 50.590000 47.880001 209106300

554 50.860001 55.700001 50.480000 270203000

555 56.990002 57.200001 51.500000 243669700

In [24]:

`predicted.shape`

Out[24]:

`(98,)`

# Prediction Table of Actual Prices vs Predicted values

In [25]:

`dfr=pd.DataFrame({'Actual_Price':test_y, 'Predicted_Price':predicted})`

dfr.head(10)

Out[25]:

The below table displays a summary statistics values of actual values vs predicted values of the dataset

In [26]:

`dfr.describe()`

Out[26]:

# Model Evaluation

MAE and RMSE are the most common statistical metrics used to measure continuous variables or in our case the accuracy of our regression models.

The math behind both Models might be confusing or a bit mouthful to absorb its meaning, but think about it in this easy way, We have actual stock close prices and predicted stock prices computed from the same actual stock prices we talked about, now we need to calculate the error or the difference between them to see how accurate these prediction compared to the actual values at hand.

# Mean Absolute Error (MAE):

MAE measures the average magnitude of the errors in a set of predictions, without considering their direction.

In [27]:

print("**Mean Absolute Error (MAE)**")

Image(filename="../input/stock-prediction-using-regression-algorithm/4.JPG", width= 400, height=250)**Mean Absolute Error (MAE)**

Out[27]:

# Root mean squared error (RMSE):

RMSE is a quadratic scoring rule that also measures the average magnitude of the error.

In [28]:

print("**Root mean squared error (RMSE)**")

Image(filename="../input/stock-prediction-using-regression-algorithm/5.JPG", width= 400, height=250)**Root mean squared error (RMSE)**

Out[28]:

# Mean squared error (MSE) :

In [29]:

print("** Mean squared error (MSE)**")

Image(filename="../input/stock-prediction-using-regression-algorithm/6.JPG", width= 400, height=250)** Mean squared error (MSE)**

Out[29]:

MSE Mean squared error (MSE) measures the average of the squares of the errors — that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss.

All mentioned metrics above can range from 0 to ∞ and are indifferent to the direction of errors. They are negatively-oriented scores, which means the lower values they present the better. Remember that RMSE will always be larger in value than MSE, Also it can penalize more error-related data so RMSE can be a better measure than MSE.

In our case our evaluation results are mentioned as following :

In [30]:

print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(test_y, predicted))

print('Mean Squared Error (MSE) :', metrics.mean_squared_error(test_y, predicted))

print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(test_y, predicted)))Mean Absolute Error (MAE): 0.7581175544856527

Mean Squared Error (MSE) : 1.001586723642404

Root Mean Squared Error (RMSE): 1.000793047359145

All of our metric results are showing values less than 1, from an interpretation standpoint, I think MAE is a better metric measurement for linear problems than RMSE, as RMSE does not describe average error alone and has other implications that are more difficult to tease out and understand. Also, RMSE gives much more importance to large errors, so models will try to minimize these as much as possible.

In [31]:

`dfr.describe()`

Out[31]:

# Model Accuracy

In [32]:

x2 = dfr.Actual_Price.mean()

y2 = dfr.Predicted_Price.mean()

Accuracy1 = x2/y2*100

print("The accuracy of the model is " , Accuracy1)The accuracy of the model is 99.68318915929602

In [33]:

plt.scatter(dfr.Actual_Price, dfr.Predicted_Price, color='Darkblue')

plt.xlabel("Actual Price")

plt.ylabel("Predicted Price")plt.show()

In [34]:

`plt.plot(dfr.Actual_Price, color='black')`

plt.plot(dfr.Predicted_Price, color='lightblue')

plt.title("Nio prediction chart")

plt.legend();

# Conclusion

The stock market has been always the hottest topic when it comes to time series forecasting or trying to feel where the market is going overall. It’s impossible to find “to go to” formula to predict the direction of the stock market, because of constant volatility of the market, the uncertainty of moving variables that could impact the stock market volatility from associated risk to political instability and Macroeconomic factors, well the list could go on.

To have better visibility on where the market is going, relying on regression models and predicting certain values based on past performance is not good enough. The following points should complement a full-fledged regression model report.

# 1- Fundamental analysis

Fundamental analysis is a method to analyze and predict the company’s intrinsic value based on historical and current performance data, these data are in form of financial statements and balance sheet information. Hence, information can be analyzed to compute the company’s current multiples such as P/E, P/B, liquidity ratios, debt ratios, Return ratios, Margins, etc. This information can give you a solid conviction on the direction of the company and help you make critical decisions either to consider investing in the company or not.

# 2- Technical Analysis

Technical analysis is the method of using statistical methods and trends based on historical data, for example, daily total volume or value of a traded stock, and evaluate historical patterns to predict future stock price movement.

# 3- Sentiment Analysis

Basically, Sentiment Analysis is the use of high-end Natural language processing to determine whether the given textual data is positive, negative, or neutral. You might conduct this analysis in paragraphs, a large set of writing textual data, reviews from your customer, research thesis, scientific papers, etc. In our case, you might use this method to analyze a Twitter account for the subject company or review from its Facebook account and so on.

About the Writer : **Abdalla A. Mahgoub**

Data Scientist | Investment Ops Analyst | Data Science Enthusiast | ML | Big Data | Python | SQL | FinTech |Strategic Planner | Business developer |Speaker | Writer | Full Stack developer and a UX designer