Applying Linear Regression on Bitcoin’s historical data

Márcio Oliveira
Coinmonks
5 min readSep 12, 2021

--

Hi folks! In this article I am going to share with you another learning experience in my path towards Artificial Intelligence: How I used a linear regression model to try to predict bitcoin’s price based on its historical data.

Do you think it is possible? Let’s see in the article.

The general approach

First things first, there’s a process (a set of quetions) that I’m using that is a kind of a “general approach” for Machine Learning problems, I learned that in codecademy’s course “Build a Machine Learning Model With Python”. Here is the deal:

  • What do we want to answer / acomplish? Predict Bitcoin’s tomorrow price based on its historical data.
  • What are relevant data to help us answer this question? BTC historical data provided by yahoo finance such as open price, close price, volume of negotiations, etc. The plan is to use this data to try to predict what will be the bitcoin’s value the next day.
  • What are some data cleaning and feature engineering that can be done? Removal of empty values, feature normalization, maybe adding quadratic features.
  • Which model best fits the problem? I’m going to use Linear Regression model because it is my object of study.
  • What is our success metric? Are we looking for accuracy? Precision? How much? I don’t know yet.. 😎 I guess I will use the Mean Absolute Error to evaluate the model’s predictions and anything bellow 100 bucks on average would be considered a success.
  • Use the model and present the results. Ok, let’s code!

Project Repository

You can checkout the project at https://github.com/marciojmo/stock-price-predictor.git

Getting Started

Okay, so the first thing I did was actually getting the data from Yahoo Finance. I’ve used pandas_datareader library to get the data straight from the internet (so cool), we may also change the ticker and grab any stock data we want. After doing that, I`ve used pandas dataframe .head() method to visualize what we got.

Grabbing data from yahoo finance using pandas_datareader library.
Taking a look at the data we got.

The Goal

Since my goal was to use just this data to predict bitcoin`s tomorrow price, I’ve added a new column named “Prediction” that is a copy of the “Close” column shifted one position up. This way every line on the dataset will have an array of features (including the close price) mapping to the bitcoin’s closing price of the next day.

Adding a new column to the data (Prediction)
Prediction value added. Note that it matches the Close price of the next day.

Data Visualization

With everything set it was time to plot some values against the prediction price and see if they have some kind of a linear relationship (visually). I’ve used a for loop to iterate over all independent variables and plot them against the Prediction value (our dependent variable in this case).

Plotting all independent variables against the dependent variable
Visualizing data relationships

Cleaning and Normalizing features

After visualizing the data I decided to take the Volume column out of the equation because it doesn’t seem to provide a good linear relationship with the prediction price.

I’ve also removed empty values using dataframe isin() function and normalized the independent variables using the MixMaxScaler from scikit preprocessing module.

Cleaning and normalizing features: removing nans and infs and scaling.

Training and Testing

With our x’s and y’s set, is time to train and test our model against the data.

I’m using train_test_split() function from scikit model_selection module to split the dataset into training (70%) and testing (30%). Then I created a LinearRegression model from scikit linear_model module and trained the model on the training set by using the fit() method.

After doing that, I tested the model against the test data and used r2_score() and mean_absolute_error() functions from scikit metrics module to evaluate the model’s performance.

Training and testing the Linear Regression model.

Results

As shown above, the model misses the tomorrow’s prediction price with an mean absolute error of $ 632,79 (2.53%). I really don’t know why the coefficient of determination is 1 in this case and I will let that to the statistics people to help me explain.

I’ve also ran the model against the last data row to predict tomorrow’s price and here is what we got:

Predicting BTC-USD tomorrows price

Let’s take a look at the Yahoo Finance website..

BTC-USD data from Yahoo finance

Pretty close, hum? Would you bet your money in this algorithm? I wouldn’t. haha. Better keep studying. See you!

Join Coinmonks Telegram Channel and Youtube Channel learn about crypto trading and investing

Also, Read

--

--

Márcio Oliveira
Coinmonks

Brazilian Software Engineer interested in Artificial Intelligence and Games.