Bitcoin Daily High Prediction

Jason Merwin

Published in

Coinmonks

13 min readApr 4, 2022

Using a machine learning classification algorithm to identify price patterns indicative of near term price movements.

Problem Introduction

Bitcoin is the leading digital currency in the world today used by holders for transactions as well as a store of value. One barrier to broader bitcoin adoption is its price instability which creates a level of risk that most investors are uncomfortable with. Models capable of more accurately predicting changes in Bitcoin prices are needed to better understand the drivers behind its volatility and provide potential investors with tools to mitigate risk.

Project Overview

Most predictive models for market prices of stocks and cryptocurrencies are time-series based like Autoregressive Integrated Moving Averages (ARIMA) or machine learning algorithms such as recurrent neural network (RNN) and long short-term memory (LSTM). Technical analysis of market price history is an alternative approach to time-series approaches for predicting market movements as it is based on patterns present in the most recent price history. Well known patterns like Wedges, Descending and Ascending Triangles, and Head and Shoulders are used as indicators of likely price movements. Machine learning models based on the recognition of price patterns instead of basing predictions on time-series data could provide additional tools to better predicting changes in Bitcoin volatility.

Approximately three years of Bitcoin price data was downloaded from Yahoo Finance as a csv file. This data was generously provided by Yahoo for free at https://www.yahoofinanceapi.com/. Historical Fear and Greed Index data during the same time period was downloaded using the free API provided by Alternate Me at https://api.alternative.me/. This index is a useful indicator of the current market condition, computed from a variety of sources to quantify investor emotion and sentiment regarding Bitcoin. The Ta library was used to compute the technical indicators Relative Strength Index (rsi), and the stochastics oscillators stoch_k and stoch_d from the historical data set. Daily price patterns of a fixed length were generated from the bitcoin price history by transposing rows of daily highs into columns in sequence in order to represent a series of prices in a single row of data. Finally, a pair of simple moving averages (SMA) calculated from the historical price data using 50 and 250 days of data and a “Signal” flag indicating when the the 50 SMA value was greater than the 250 SMA was calculated and added to the training dataset. The technical details of how the data was compiled and prepared for machine learning can be found at the GitHub Repository for this project located here.

Strategy to solve the problem

The idea being explored by this project is the use of machine learning algorithms which are good at recognizing patterns in data for classification to identify patterns of Bitcoin prices associated with increases in the near term. Pattern recognition models would use an approach similar to that of technical analysis providing an alternative technique to time-series modeling. For this project the XGBoost classifier algorithm was chosen which is a implementation of the gradient boosted trees algorithm that has a history of high performance classification. The classifier will be trained off historical daily high price patterns with a binary label indicating if the subsequent day’s high is higher than the current day.

Model Performance Metrics

The model performance will be measured using the accuracy metric. The accuracy metric is computed as the ratio of total number of correct predictions (true positives and true negatives) to the total number of predictions (true positives, true negatives, false positives, and false negatives). This metric was chosen because the model output is a binary prediction of the next day’s high, thus it has two ways of being correct (both a true positive and a true negative should be counted as correct). Thus we want to use accuracy to tune the model to maximize total correct predictions, both positive and negative, as opposed to other metrics such as precision or recall which focus on the number of positive predictions.

Exploratory Data Analysis

Approximately three years of Bitcoin price data was downloaded from Yahoo Finance as a csv file. These data were generously provided by Yahoo for free at https://www.yahoofinanceapi.com/. Historical Fear and Greed Index data during the same time period was downloaded using the free API provided by Alternate Me at https://api.alternative.me/. This index is a useful indicator of the current market condition, computed from a variety of sources to quantify investor emotion and sentiment regarding Bitcoin. The Ta library was used to compute the technical indicators Relative Strength Index (rsi), and the stochastics oscillators stoch_k and stoch_d from the historical data set. Daily price patterns of a fixed length were generated from the bitcoin price history by transposing rows of daily highs into columns in sequence in order to represent a series of prices in a single row of data. Finally, a pair of simple moving averages (SMA) calculated from the historical price data using 50 and 250 days of data and a “Signal” flag indicating when the the 50 SMA value was greater than the 250 SMA was calculated and added to the training dataset. The technical details of how the data was created and prepared for machine learning can be found at the GitHub Repository for this project located here.

Several aspects of the data set were explored and optimal ranges for the engineered features were measured. An iterative approach was used to test a range of values for the engineered features in which each value in the range was used to produce a model and measure it’s accuracy a number of times in order to produce an accuracy boxplot.

The optimal number of days to include in the price pattern

This model uses a set number of historical daily high values for bitcoin from which to identify patterns associated and increase in the daily high for the subsequent day. In order to determine the optimal number of historical observations for pattern recognition, a base XGBoost model was iterated through a range of days from 2 to 45, taking the average of 5 separate model accuracy scores for each day value. A box plot of the average accuracy score per number of historical days was produced from the results in order to identify the optimal number.

The plot indicates that a surprisingly small range of daily highs, between 3 and 5, produces the best accuracy for price prediction of approximately 0.73. Increasing the number of daily highs above 4 causes the accuracy to progressively decline until about 30 days, at which point it stabilizes and oscillated between around 0.68 and 0.7. Using this information we will proceed with a daily observations value of 4.

Optimizing the length of history to include in the training data

The complete data set after merging, calculating indicators, and removing rows with NAs due to their occurrence before the indicators could be calculated contains around 1000 rows or a little less than 3 years of historical bitcoin prices. A plot of dataset shows that bitcoin daily highs experienced a period of explosive growth in 2020 (around day 650 in the dataset) before stabilizing in early 2021.

Since the model is using price patterns as input, it is possible that including the period of explosive growth will not accurately reflect the current period of oscillating values we have been in since 2021. An iteration of the model through a range of dates to include in the data set was carried out to identify the optimal number of days to include in the training data for maximum accuracy.

The resulting accuracy plot shows the accuracy of the model is improved by increasing the number of rows above 375 until about 475. Increasing rows beyond 475 causes a slight decrease in accuracy until we reach about 800, when it begins to increase again with increasing number of included rows. Since the largest value of 875 rows gives an accuracy of around 0.73, which is approximately equivalent to the accuracy measured from 450 rows, and 875 is reaching the maximum limit of the dataset, it seems reasonable to select 450 rows as our optimal number moving forward.

Optimizing Simple Moving Average Pairs

Simple Moving Averages (SMA) are common technical indicators which help investors predict long term trends in asset value. Typically, SMAs generated from 50, 100, and 200 days are used. Since the SMAs are being calculated directly from the historical dataset, we can use an iterative process to identify if specific combinations of SMA produced different levels of accuracy in the model. A similar iterative approach as used as before to test the effect of a range of SMA values on the model accuracy.

The heatmap appears to show an essentially random distribution of scores across the combinations tested, thus the standard values of 50 and 250 days will be used for the SMAs. Repeating the grid multiple times and with greater number of runs per pair confirmed the lack of correlation between accuracy and a given pair’s number of days (data not shown).

Methodology

All of the data processing steps and preparation for model fitting were carried out using python. Each step in the process was modularized as an executable function with the following code. A complete repository of the project code can be found here.

#Functions for data processing, feature engineering, and data optimization


def cross_fold_accuracy(X, y):
    '''
    input - 
        X - the features list from training data
        y - the label list from training data
    output - an accuracy score
    '''    
    # define the classifier
    model = XGBClassifier()
    # evaluate the model with cross validation
    cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
    n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    # print performance and return accuracy score
    #print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
    accuracy = mean(n_scores)
    
    return accuracy
def row_transpose(index, rows, BTC_data):
    '''
    input - 
        index - the row number of the dataset to start transposition on
        rows - the number of dates (rows) to transpose into columns and add to the dataframe
        BTC_data - the dataframe upoin which to act
    output - a dataframe with a single row of the transposed dates and the row values of the final date in the series
    '''
    #transpose the daily High column
    High_row = pd.DataFrame(BTC_data['High'].iloc[index-rows:index])
    High_row_T = High_row.transpose()
    High_row_T = High_row_T.reset_index().drop(columns=['index'])

    #isolate the selected row for merging
    row_to_merge = pd.DataFrame(BTC_data.iloc[index-1])
    row_to_merge = row_to_merge.transpose()
    row_to_merge = row_to_merge.reset_index().drop(columns=['High','index'])

    #merge and change the column names
    merged_row = pd.concat([High_row_T, row_to_merge], axis=1)
    for i in range(rows+1):
        merged_row.rename(columns={ merged_row.columns[i-1]: i }, inplace = True)
        merged_row.columns = [*merged_row.columns[:-1], 'week_year_numeric'] 
    
    return merged_row
def iterate_dataframe(starting_index, rows, BTC_data):
    '''
    input - 
        starting_index = scalar value for starting row in dataframe, 
        rows = number of days (rows) prior to index to transpose, 
        BTC_data = dataset
    output - transposed row values as columns
    '''
    #set the parameters
    #print('iterating over rows...')
    total_rows = len(BTC_data.index)
    row_increase = 0

    #Get first row
    result = row_transpose(starting_index, rows, BTC_data)

    #transpose rows
    for i in range(starting_index,total_rows):
        row_increase = row_increase+1
        inter_results = row_transpose(starting_index+row_increase, rows, BTC_data)
        result = result.append(inter_results)
        
    #rename transposed columns    
    result = create_label(starting_index, rows, result)
    result.rename(columns = {'Day_0':'week_numeric'}, inplace = True)
     
    return result
def generate_lists(df):
    '''
    input - Dataset containing features and label identified as 'Label'
    output - 'labels' = Label list, 'features' = Features list, 'feature_list' = and a list of feature names
    '''
    #generate the labels and the features
    labels = np.array(df['Label'])
    features = df.drop('Label', axis = 1)

    # Saving feature names as list
    feature_list = list(features.columns)

    # Convert features to array
    features = np.array(features)
    
    return labels, features, feature_list
def create_label(index, rows, df):
    '''
    input - 
        index - the row number to start transposition on
        rows - the number of dates to transpose as columns and the dataframe
        df - the dataframe upoin which to act
    output - a dataframe with the final daily high value removed, repalced with a label column
    '''
    #replace row names
    for i in range(rows+1):
        row_name = 'Day_' + str(i)
        df.rename(columns={ df.columns[i-1]: row_name }, inplace = True)
        
    #drop final day and create label
    untimate_col = 'Day_' + str(rows)
    penultimat_col = 'Day_' + str((rows-1))
    df['Pre_Label'] = df[untimate_col] - df[penultimat_col]

    df.loc[df['Pre_Label'] <= 0, 'Label'] = 0 
    df.loc[df['Pre_Label'] > 0, 'Label'] = 1 

    #clean up NAs and auxiliary columns
    df = df.drop(columns=['Pre_Label', untimate_col])
    df = df.dropna()
        
    return df
def SMA(data, period=30, column='High'):
    '''
    input - 
        data - dataset containing historical daily prices
        period - the number of days for which to calculate the moving average
        column - the name of the column containing the value to be averaged
    output - the simple moving average for the indicated period (days)
    '''
    return data[column].rolling(window=period).mean()
def SMA_minor_major(minor_SMA, major_SMA, data):
    '''
    input - 
        minor_SMA - the number of days from which the smalled SMA will be calculated
        major_SMA - the number of days from which the larger SMA will be calculated
        data - dataframe containing the historical price information
    output - the input dataframe with simple moving average columns added for the indicated days
    '''
    #add the 10 and 50 day
    data['minor_SMA'] = SMA(data, minor_SMA)
    data['major_SMA'] = SMA(data, major_SMA)

    # Get buy and sell signals
    data['Signal'] = np.where(data['minor_SMA'] > data['major_SMA'], 1, 0)
    data['Signal_vol'] = data['minor_SMA'] - data['major_SMA']
    return data
def prep_for_ml(prepared_data, data_SMA_df, total_days=400):
    '''
    input - 
        prepared_data - a dataframe containing the features including the transposed rows as previous day high column values
        data_SMA_df - a dataframe containing the calculated SMA values
        total_days - value indicating the number of rows counted from the end to include in the output dataframe
    output - a merged data set containing all the features and label needed for model training
    '''
    #merge on the date
    merged_data = prepared_data.merge(data_SMA_df, left_on='date_format', right_on='date_format', suffixes=('', '_y'))
    #clean up the columns and NAs
    merged_data.drop(columns=['High', 'date_format', 'day_count_y', 'Open_y', 'Low_y', 'Close_y', 'Adj Close_y', 'Volume_y',
                          'rsi_y', 'stoch_k_y', 'stoch_d_y', 'Fear_N_Greed_y', 'month_numeric_y', 'weekday_numeric_y',
                          'week_month_numeric_y', 'minor_SMA_y', 'major_SMA_y', 'Signal_y'
                         ], inplace=True)
    #drop the na rows
    data_clean = merged_data.dropna()
    #take the most recent days (default = 400)
    data_clean = data_clean.tail(total_days)
    return data_clean

Modelling and Hyperparameter Tuning

The model was trained using a pipeline including a standard scaler to scale the data and the XGBoost classifier. A parameter grid was defined as shown below to test a range for each parameter and the model was fit using GridSearch with cross-validation.

#generate training and testing datasets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.1)

#set up pipeline
pipe = Pipeline([
                 ('scl', StandardScaler()),
                 ('m', XGBClassifier())
                ])

#define the parameter ranges as grid
param_grid = {
    "m__n_estimators": range(25,100,25),
    'm__max_depth':range(3,10,2),
    'm__min_child_weight':range(1,6,2),
    'm__subsample':[i/10.0 for i in range(6,10)],
    'm__colsample_bytree':[i/10.0 for i in range(6,10)],
    'm__reg_alpha':[1e-5, 1e-2, 0.1, 1, 100] 
}

#Instantiate the cross validation grid search
gs_cv = GridSearchCV(estimator=pipe,
                     param_grid=param_grid,
                     n_jobs=-1)

#fit the model
gs_cv.fit(train_features, train_labels)

#report accuracy and optimal parameter values
print("Best parameter (CV score=%0.3f):" % gs_cv.best_score_)
print(gs_cv.best_params_)Best parameter (CV score=0.764):
{'m__colsample_bytree': 0.8, 'm__max_depth': 7, 'm__min_child_weight': 3, 'm__n_estimators': 75, 'm__reg_alpha': 0.01, 'm__subsample': 0.7}

Results

Optimization of the model hyper-parameters improved it’s accuracy from 0.725 to 0.764. This value suggests that the model can be used to predict if tomorrow’s high for Bitcoin relative to today’s with 76.4% accuracy.

Conclusion/Reflection

The goal of this project was to train a classification model to recognize patterns in daily bitcoin price history that predict an increase in the next daily high value. An XGBoost model was used, demonstrating an accuracy of 0.764 after optimization. Several features of the data set were investigated in this study for their impact on model accuracy which revealed useful insights regarding the modeling Bitcoin volatility.

Too much of a good thing. Exploration of the dataset features revealed there were optimal ranges of data to include in the model within somewhat narrow space. For example, accuracy box plots indicated that a including a surprisingly small range of daily highs, between 3 and 5, produces the best accuracy for price prediction. Increasing the number of daily highs above 5 caused the accuracy to progressively decline until about 30 days, at which point it stabilized and oscillated between around 0.68 and 0.7.

Know your history. A similar small optimal range was found for the number of days of historical data to include. Accuracy increased with increasing number of days until 450, at which point increasing the number of days caused accuracy to decline. This is likely because 450 days of history corresponds to the beginning of 2021 when bitcoin prices ended a phase of explosive growth, thus including data from the growth period prior to that date likely confuses the model. Eventually the accuracy does start to recover after around 700 days as it reaches the stage of relative price stability prior to the explosive growth of 2020.

Averages average out. The number of days used for computing the pair of simple moving averages included in the training data does not appear to have an obvious trend or optimal range for the model’s performance. A heatmap showing the model’s accuracy across a range of combinations of days used for a smaller moving average and the large moving average (referred to in this study as minor and major, respectfully) appeared essentially random. Repeating the grid multiple times and with greater number of runs per pair confirmed the lack of correlation or obvious trend between accuracy and a given pair’s number of days (data not shown). While the inclusion of SMAs in the dataset improves accuracy, there does not appear to be clear choice as to which pairs to use. This could be because all pairs perform equally well or because simple moving averages are better correlated with medium to long term price changes.

Improvements

The model’s performance could potentially be improved by using different classification algorithms or by using multiple individually optimized algorithms together as an ensemble model. Additionally, while the output of this model is binary, the prediction itself is a probability between 0 and 1 which could potentially be calibrated with historical data to give an idea of prediction confidence and amplitude of the expected price movement. The insights derived from the exploration of the data and model performance described above could be useful considerations when building other market prediction algorithms.

References

BTC historical data API (Free): https://www.yahoofinanceapi.com/
Alternative Me Fear and Greed API (Free): https://api.alternative.me/
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
https://www.machinelearningplus.com/plots/python-boxplot/
https://www.journaldev.com/32984/numpy-matrix-transpose-array
https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/
https://medium.com/data-folks-indonesia/simple-moving-average-sma-indicator-using-machine-learning-e8951f61dd9b

Join Coinmonks Telegram Channel and Youtube Channel learn about crypto trading and investing