Published in

CodeX

# Stock Prediction using Regression Algorithm in Python

## An end-to-end explanation on using ML algorithms to predict stock prices

For this exercise, I will use the Yfinance library for scrapping information right away from Yahoo Finance website, Yahoo Finance is a great website that gives a quick glimpse of listed equities, funds, and investment data for the company you are looking for. Also, I will test and predict certain parameters and stock prices by implementing regression model analysis on the data.

# Importing all necessary libraries

Firstly, we will the required libraries for this exercise to be executed successfully. The main library to call and pay attention to here is yfinance. This library will enable us to extract and call different data from the Yahoo website.

In [1]:

`!pip install yfinanceCollecting yfinance  Downloading yfinance-0.1.59.tar.gz (25 kB)Requirement already satisfied: pandas>=0.24 in /opt/conda/lib/python3.7/site-packages (from yfinance) (1.2.2)Requirement already satisfied: numpy>=1.15 in /opt/conda/lib/python3.7/site-packages (from yfinance) (1.19.5)Requirement already satisfied: requests>=2.20 in /opt/conda/lib/python3.7/site-packages (from yfinance) (2.25.1)Collecting multitasking>=0.0.7  Downloading multitasking-0.0.9.tar.gz (8.1 kB)Requirement already satisfied: lxml>=4.5.1 in /opt/conda/lib/python3.7/site-packages (from yfinance) (4.6.3)Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.24->yfinance) (2.8.1)Requirement already satisfied: pytz>=2017.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.24->yfinance) (2021.1)Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas>=0.24->yfinance) (1.15.0)Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (1.26.3)Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (2.10)Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (2020.12.5)Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests>=2.20->yfinance) (3.0.4)Building wheels for collected packages: yfinance, multitasking  Building wheel for yfinance (setup.py) ... - \ done  Created wheel for yfinance: filename=yfinance-0.1.59-py2.py3-none-any.whl size=23442 sha256=33c81ce98e6e86be0df9c086ce8b9f54138a14a1079c79d74003a7cfc00b8974  Stored in directory: /root/.cache/pip/wheels/26/af/8b/fac1b47dffef567f945641cdc9b67bb25fae5725d462a8cf81  Building wheel for multitasking (setup.py) ... - done  Created wheel for multitasking: filename=multitasking-0.0.9-py3-none-any.whl size=8368 sha256=c14a0494e534aacbc1170fdaed133fb002cae59c00456e90086fe6970f5dd186  Stored in directory: /root/.cache/pip/wheels/ae/25/47/4d68431a7ec1b6c4b5233365934b74c1d4e665bf5f968d363aSuccessfully built yfinance multitaskingInstalling collected packages: multitasking, yfinanceSuccessfully installed multitasking-0.0.9 yfinance-0.1.59`

In [2]:

`# Let's start with calling all dependencies that we will use for this exercise import pandas as pdimport numpy as npimport math import seaborn as sns import matplotlib.pyplot as pltfrom sklearn import metricsfrom sklearn.model_selection import train_test_splitimport yfinance as yf  # We will use this library to upload latest data from Yahoo API%matplotlib inlineplt.style.use('fivethirtyeight')`

# Data collection and exploration

In this section, We will use certain methods to call data related to company description, stock prices, close prices, volumes, and corporate actions associated with the stock.

In [3]:

`# define the ticker you will usenio = yf.Ticker('NIO')#Display stock information, it will give you a summary description of the tickernio.info`

Out[3]:

`{'zip': '201804', 'sector': 'Consumer Cyclical', 'fullTimeEmployees': 7763, 'longBusinessSummary': 'NIO Inc. designs, develops, manufactures, and sells smart electric vehicles in Mainland China, Hong Kong, the United States, the United Kingdom, and Germany. The company offers five, six, and seven-seater electric SUVs, as well as smart electric sedans. It is also involved in the provision of energy and service packages to its users; marketing, design, and technology development activities; manufacture of e-powertrains, battery packs, and components; and sales and after sales management activities. In addition, the company offers power solutions, including Power Home, a home charging solution; Power Swap, a battery swapping service; Public Charger, a public fast charging solution; Power Mobile, a mobile charging service through charging vans; Power Map, an application that provides access to a network of public chargers and their real-time information; and One Click for Power valet service, where it offers vehicle pick up, charging, and return services. Further, it provides repair, maintenance, and bodywork services through its NIO service centers and authorized third-party service centers; statutory and third-party liability insurance, and vehicle damage insurance through third-party insurers; courtesy car services; and roadside assistance, as well as data packages; and auto financing services. Additionally, the company offers NIO Certified, an used vehicle inspection, evaluation, acquisition, and sales service. NIO Inc. has a strategic collaboration with Mobileye N.V. for the development of automated and autonomous vehicles for consumer markets. The company was formerly known as NextEV Inc. and changed its name to NIO Inc. in July 2017. NIO Inc. was founded in 2014 and is headquartered in Shanghai, China.', 'city': 'Shanghai', 'phone': '86 21 6908 2018', 'country': 'China', 'companyOfficers': [], 'website': 'http://www.nio.com', 'maxAge': 1, 'address1': 'Building 20', 'industry': 'Auto Manufacturers', 'address2': 'No. 56 AnTuo Road Anting Town Jiading District', 'previousClose': 38.12, 'regularMarketOpen': 37.96, 'twoHundredDayAverage': 43.683704, 'trailingAnnualDividendYield': None, 'payoutRatio': 0, 'volume24Hr': None, 'regularMarketDayHigh': 38, 'navPrice': None, 'averageDailyVolume10Day': 70446940, 'totalAssets': None, 'regularMarketPreviousClose': 38.12, 'fiftyDayAverage': 41.888824, 'trailingAnnualDividendRate': None, 'open': 37.96, 'toCurrency': None, 'averageVolume10days': 70446940, 'expireDate': None, 'yield': None, 'algorithm': None, 'dividendRate': None, 'exDividendDate': None, 'beta': 2.613181, 'circulatingSupply': None, 'startDate': None, 'regularMarketDayLow': 36.76, 'priceHint': 2, 'currency': 'USD', 'regularMarketVolume': 52839807, 'lastMarket': None, 'maxSupply': None, 'openInterest': None, 'marketCap': 60854632448, 'volumeAllCurrencies': None, 'strikePrice': None, 'averageVolume': 97445091, 'priceToSalesTrailing12Months': 3.743073, 'dayLow': 36.76, 'ask': 37.43, 'ytdReturn': None, 'askSize': 3000, 'volume': 52839807, 'fiftyTwoWeekHigh': 66.99, 'forwardPE': -371.4, 'fromCurrency': None, 'fiveYearAvgDividendYield': None, 'fiftyTwoWeekLow': 2.88, 'bid': 37.21, 'tradeable': False, 'dividendYield': None, 'bidSize': 1400, 'dayHigh': 38, 'exchange': 'NYQ', 'shortName': 'NIO Inc.', 'longName': 'NIO Inc.', 'exchangeTimezoneName': 'America/New_York', 'exchangeTimezoneShortName': 'EDT', 'isEsgPopulated': False, 'gmtOffSetMilliseconds': '-14400000', 'quoteType': 'EQUITY', 'symbol': 'NIO', 'messageBoardId': 'finmb_311626862', 'market': 'us_market', 'annualHoldingsTurnover': None, 'enterpriseToRevenue': 2.107, 'beta3Year': None, 'profitMargins': -0.34511003, 'enterpriseToEbitda': -9.618, '52WeekChange': 11.664452, 'morningStarRiskRating': None, 'forwardEps': -0.1, 'revenueQuarterlyGrowth': None, 'sharesOutstanding': 1638520064, 'fundInceptionDate': None, 'annualReportExpenseRatio': None, 'bookValue': 17.315, 'sharesShort': 65406021, 'sharesPercentSharesOut': 0.0399, 'fundFamily': None, 'lastFiscalYearEnd': 1609372800, 'heldPercentInstitutions': 0.36421, 'netIncomeToCommon': -5610789888, 'trailingEps': -0.724, 'lastDividendValue': None, 'SandP52WeekChange': 0.45070732, 'priceToBook': 2.1449609, 'heldPercentInsiders': 0.00533, 'nextFiscalYearEnd': 1672444800, 'mostRecentQuarter': 1609372800, 'shortRatio': 0.53, 'sharesShortPreviousMonthDate': 1614297600, 'floatShares': 1327183283, 'enterpriseValue': 34251128832, 'threeYearAverageReturn': None, 'lastSplitDate': None, 'lastSplitFactor': None, 'legalType': None, 'lastDividendDate': None, 'morningStarOverallRating': None, 'earningsQuarterlyGrowth': None, 'dateShortInterest': 1617148800, 'pegRatio': 4298.73, 'lastCapGain': None, 'shortPercentOfFloat': None, 'sharesShortPriorMonth': 48084071, 'impliedSharesOutstanding': None, 'category': None, 'fiveYearAverageReturn': None, 'regularMarketPrice': 37.14, 'logo_url': 'https://logo.clearbit.com/nio.com'}`

# Data preparation and cleaning for Regression analysis

The data from Yahoo Finance is straightforward and reflects real-time data of the stock market, therefore cleaning and processing the exported data will not be a difficult task.

In [4]:

`history = nio.history(period="Max")df = pd.DataFrame(history)df.head(10)`

Out[4]:

In [5]:

`# defining x and y x = df.indexy = df['Close']y`

Out[5]:

`Date2018-09-12     6.6000002018-09-13    11.6000002018-09-14     9.9000002018-09-17     8.5000002018-09-18     7.680000                ...    2021-04-06    40.0000002021-04-07    37.2700002021-04-08    38.7000012021-04-09    38.1199992021-04-12    37.139999Name: Close, Length: 649, dtype: float64`

In [6]:

`# Data Exploration# i like to set up a plot function so i can reuse it at later stages of this analysis def df_plot(data, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):    plt.figure(figsize=(16,5), dpi=dpi)    plt.plot(x, y, color='tab:red')    plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)    plt.show()`

In [7]:

`stock_name= "NIO"title = (stock_name,"History stock performance till date")df_plot(df , x , y , title=title,xlabel='Date', ylabel='Value',dpi=100)`

In [8]:

`# Data Processing and scalingdf.reset_index(inplace=True) # to reset index and convert it to column`

In [9]:

`df.head(2)`

Out[9]:

In [10]:

`df.columns ['date','open','high','low','close','vol','divs','split']`

In [11]:

`df.drop(columns=['divs','split']).head(2) # We are dropping un necessary columns from the set`

Out[11]:

In [12]:

`df['date'] = pd.to_datetime(df.date)`

In [13]:

`df.describe()`

Out[13]:

In [14]:

`print(len(df))649`

In [15]:

`x = df[['open', 'high','low', 'vol']]y = df['close']`

# Data Split

For this data, I’ve split the data into training and test datasets with a test size of 15% of the total dataset. Afterward, we can simply check if the data was split successfully by using the shape() method.

In [16]:

`# Linear regression Model for stock prediction train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.15 , shuffle=False,random_state = 0)`

In [17]:

`# let's check if total observation makes senseprint(train_x.shape )print(test_x.shape)print(train_y.shape)print(test_y.shape)(551, 4)(98, 4)(551,)(98,)`

# Regression algorithm model implementation

Before we get to the technical part of implementing the regression model to the dataset, let’s talk a bit about the regression algorithm. Basically, Regression is a set of techniques for estimating relationships. For example in real life, we can relate the force for stretching a spring and the distance that the spring stretches (the likes in Hooke’s law), or explain how many transistors the semiconductor industry can pack into a circuit over time (Moore’s law).

The equation for linear regression can be written as follows:

In [18]:

`import osfrom IPython.display import Imageprint("**Linear Regression Formula**")Image(filename="../input/stock-prediction-using-regression-algorithm/1.JPG", width= 250, height=100)**Linear Regression Formula**`

Out[18]:

Also

In [19]:

`print("**Regression Formula**")Image(filename="../input/stock-prediction-using-regression-algorithm/2.JPG", width= 280, height=110)**Regression Formula**`

Out[19]:

Where, x1, x2,….xn represents the independent variables while the coefficients θ1, θ2, θn represent the weights.

In [20]:

`from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import confusion_matrix, accuracy_scoreregression = LinearRegression()regression.fit(train_x, train_y)print("regression coefficient",regression.coef_)print("regression intercept",regression.intercept_)regression coefficient [-6.51840470e-01  8.48419125e-01  8.12048390e-01 -3.50557805e-10]regression intercept -0.0315814559475216`

# The coefficient of determination R²

Here we will compute the coefficient of determination denoted by R², which takes values between 0 and 1, the higher the value R² the more successful the linear regression is at explaining the variation of Y values, in our case the Y values represent the close stock prices of the subjected company. The below is the math behind The coefficient of determination R²

In [21]:

`print("**List of equations**")Image(filename="../input/stock-prediction-using-regression-algorithm/3.JPG", width= 400, height=250)**List of equations**`

Out[21]:

In [22]:

`# the coefficient of determination R² regression_confidence = regression.score(test_x, test_y)print("linear regression confidence: ", regression_confidence)linear regression confidence:  0.9836914831421212`

The coefficient of determination R² for our data is at 0.98 which’s 98%, which means that our model is a linear model that explains the variation of all Y values.

# Prediction

As we can see below, the predicted list of data points from open, high, low, and vol are not sorted based on time or date, at this point It’s not important to sort these data point, as we will plot is based on their associated dates using scatter plot() method.

In [23]:

`predicted=regression.predict(test_x)print(test_x.head())open       high        low        vol551  45.750000  46.720001  42.500000  271678300552  45.360001  48.919998  44.680000  233779100553  48.270000  50.590000  47.880001  209106300554  50.860001  55.700001  50.480000  270203000555  56.990002  57.200001  51.500000  243669700`

In [24]:

`predicted.shape`

Out[24]:

`(98,)`

# Prediction Table of Actual Prices vs Predicted values

In [25]:

`dfr=pd.DataFrame({'Actual_Price':test_y, 'Predicted_Price':predicted})dfr.head(10)`

Out[25]:

The below table displays a summary statistics values of actual values vs predicted values of the dataset

In [26]:

`dfr.describe()`

Out[26]:

# Model Evaluation

MAE and RMSE are the most common statistical metrics used to measure continuous variables or in our case the accuracy of our regression models.

The math behind both Models might be confusing or a bit mouthful to absorb its meaning, but think about it in this easy way, We have actual stock close prices and predicted stock prices computed from the same actual stock prices we talked about, now we need to calculate the error or the difference between them to see how accurate these prediction compared to the actual values at hand.

# Mean Absolute Error (MAE):

MAE measures the average magnitude of the errors in a set of predictions, without considering their direction.

In [27]:

`print("**Mean Absolute Error (MAE)**")Image(filename="../input/stock-prediction-using-regression-algorithm/4.JPG", width= 400, height=250)**Mean Absolute Error (MAE)**`

Out[27]:

# Root mean squared error (RMSE):

RMSE is a quadratic scoring rule that also measures the average magnitude of the error.

In [28]:

`print("**Root mean squared error (RMSE)**")Image(filename="../input/stock-prediction-using-regression-algorithm/5.JPG", width= 400, height=250)**Root mean squared error (RMSE)**`

Out[28]:

# Mean squared error (MSE) :

In [29]:

`print("** Mean squared error (MSE)**")Image(filename="../input/stock-prediction-using-regression-algorithm/6.JPG", width= 400, height=250)** Mean squared error (MSE)**`

Out[29]:

MSE Mean squared error (MSE) measures the average of the squares of the errors — that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss.

All mentioned metrics above can range from 0 to ∞ and are indifferent to the direction of errors. They are negatively-oriented scores, which means the lower values they present the better. Remember that RMSE will always be larger in value than MSE, Also it can penalize more error-related data so RMSE can be a better measure than MSE.

In our case our evaluation results are mentioned as following :

In [30]:

`print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(test_y, predicted))print('Mean Squared Error (MSE) :', metrics.mean_squared_error(test_y, predicted))print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(test_y, predicted)))Mean Absolute Error (MAE): 0.7581175544856527Mean Squared Error (MSE) : 1.001586723642404Root Mean Squared Error (RMSE): 1.000793047359145`

All of our metric results are showing values less than 1, from an interpretation standpoint, I think MAE is a better metric measurement for linear problems than RMSE, as RMSE does not describe average error alone and has other implications that are more difficult to tease out and understand. Also, RMSE gives much more importance to large errors, so models will try to minimize these as much as possible.

In [31]:

`dfr.describe()`

Out[31]:

# Model Accuracy

In [32]:

`x2 = dfr.Actual_Price.mean()y2 = dfr.Predicted_Price.mean()Accuracy1 = x2/y2*100print("The accuracy of the model is " , Accuracy1)The accuracy of the model is  99.68318915929602`

In [33]:

`plt.scatter(dfr.Actual_Price, dfr.Predicted_Price,  color='Darkblue')plt.xlabel("Actual Price")plt.ylabel("Predicted Price")plt.show()`

In [34]:

`plt.plot(dfr.Actual_Price, color='black')plt.plot(dfr.Predicted_Price, color='lightblue')plt.title("Nio prediction chart")plt.legend();`

# Conclusion

The stock market has been always the hottest topic when it comes to time series forecasting or trying to feel where the market is going overall. It’s impossible to find “to go to” formula to predict the direction of the stock market, because of constant volatility of the market, the uncertainty of moving variables that could impact the stock market volatility from associated risk to political instability and Macroeconomic factors, well the list could go on.

To have better visibility on where the market is going, relying on regression models and predicting certain values based on past performance is not good enough. The following points should complement a full-fledged regression model report.

# 1- Fundamental analysis

Fundamental analysis is a method to analyze and predict the company’s intrinsic value based on historical and current performance data, these data are in form of financial statements and balance sheet information. Hence, information can be analyzed to compute the company’s current multiples such as P/E, P/B, liquidity ratios, debt ratios, Return ratios, Margins, etc. This information can give you a solid conviction on the direction of the company and help you make critical decisions either to consider investing in the company or not.

# 2- Technical Analysis

Technical analysis is the method of using statistical methods and trends based on historical data, for example, daily total volume or value of a traded stock, and evaluate historical patterns to predict future stock price movement.

# 3- Sentiment Analysis

Basically, Sentiment Analysis is the use of high-end Natural language processing to determine whether the given textual data is positive, negative, or neutral. You might conduct this analysis in paragraphs, a large set of writing textual data, reviews from your customer, research thesis, scientific papers, etc. In our case, you might use this method to analyze a Twitter account for the subject company or review from its Facebook account and so on.