Predicting Ether Prices & Model Selection for Machine Learning

I will be taking a look at Ether pricing data and doing discovery, exploration, cleaning and general analysis through visualization, pair plots and correlation matrices to get a sense of the data before going into the model selection.

Irene Bratsis

Published in

Analytics Vidhya

7 min readFeb 1, 2021

I’ve chosen this data set because I have Ether and I’m a big fan of the decentralized information and infrastructure applications Ethereum & Consensys make possible.

A bit about the data gathered: Ethereum dataset is obtained from Etherscan

Etherscan is a Block Explorer and Analytics Platform for Ethereum, a decentralized smart contracts platform. Ethereum is an open source, public, blockchain-based distributed computing platform and operating system featuring smart contract (scripting) functionality.

It supports a modified version of Nakamoto consensus via transaction-based state transitions. Ether is a cryptocurrency generated by the Ethereum platform and used to compensate mining nodes for computations performed. Each Ethereum account has an ether balance and ether may be transferred from one account to another.

Original content can be found on Etherscan.io here: https://etherscan.io/charts

Dataset can be found on Kaggle here: https://www.kaggle.com/sudalairajkumar/cryptocurrencypricehistory

Variables of Interest:

Our Target Y:

eth_etherprice : price of ethereum 30 Days in the future

Our 9 Variables X: This is a list of the variables I was interested in before I started testing various models.

eth_tx : # of transactions per day

eth_address : Cumulative address growth

eth_marketcap : Market cap in USD

eth_hashrate : hash rate in GH/s Blocks form a chain by referring to the hash or fingerprint of the previous block.

eth_difficulty : Difficulty level in TH

eth_blocksize : average block size in bytes (Blocks of data -transactions and smart contracts)

eth_gaslimit : Gas limit per day

eth_gasused : total gas used per day

eth_uncles : number of uncles per day

List of 5 variables after testing various models below:

eth_tx : # of transactions per day

eth_address : Cumulative address growth

eth_hashrate : hash rate in GH/s

eth_difficulty : Difficulty level in TH

eth_blocksize : average block size in bytes

For the purposes of this Ethereum dataset, we will be setting our Y variable to predicting “eth_etherprice” (price of ethereum) based on several factors uncovered during the data exploration phase

I will essentially be creating a predictor that would work for us to use to predict the price 30 days in the future.

Before we start, here are a few things to consider:

Find a model that would accurately forecast Ethereum prices
Ensure the information is sufficient to effectively forecast
Determine if there are any sort of patterns in our data before building models
Selecting a model by choosing the best performing model for our data
Check to see if our model is effective enough to make predictions!

Who would be most interested in this? Let’s explore the practical uses of our model for an audience of interest.

Now the fun part: Model Comparisons! Here I will explore how I chose my model specification and what alternatives I compared it to. Let’s start with exploratory data analysis and feature selection:

Next we will get a sense of the distribution:

As well as a sense of the correlations between the various variables (columns of data) we have that contribute most to the price of ether:

These are the 9 variables that I found most impactful to eth_etherprice based on the .corr list as well as other data exploration steps done above: ‘eth_tx’, ‘eth_address’, ‘eth_marketcap’, ‘eth_hashrate’, ‘eth_blocksize’, ‘eth_gasused’, ‘eth_gaslimit’, ‘eth_difficulty’, ‘eth_uncles’

For this reason I will be using these variables in my model preparations further in this challenge. I will be working with these variables in my models as explore the best combination of variables to include in my models.

The variables I ended up keeping after a few iterations with different models in consideration of P Values and correlation values

‘eth_address’

‘eth_hashrate’

‘eth_tx’

‘eth_blocksize’

‘eth_difficulty’

‘eth_gasprice’

Before we get started, I will be moving our pricing by 30 days so that our variables are matched with a date in the future. This will allow our model to use existing variables for pricing 30 days in the future so that we can later use it for predictions.

Doing this will make sure that our models can predict ether prices 30 days in the future, otherwise our model isn’t telling us much as the data we collect happens in real time. Below I create a new column called ether30.

It looks like we’re ready for the next step, now that we’ve accounted for our null values and created a new dataframe for us to work off of. Let’s begin our model exploration!

OLS:

First up OLS with R-squared of the model is 0.898 and the test R-squared is 0.889

It could be indicative of overfitting but my test and training scores are similar, so it looks like I can trust the results. Let’s see how other models perform!

A bit about model selection:

We want to have a sense of how each model is performing and ultimately this will influence the model we end up choosing to make the predictions. Below is a robust list of KPIs we want to look at when selecting a model. I am particularly interested in the mean absolute percentage error and the root mean squared error, as well as the R-squared.

You will need to think carefully about which metrics are most relevant to you and your data problem, but this is a good list to get you started as you’re sorting and choosing models yourself. The below list pertains to my OLS model.

Next up is the Ridge Regression model:

Great! Let’s see how that performed against our OLS model.

Ridge produced a score of 0.897 for the training set and a 0.888 score for the test set

Next is Lasso which got a score of 0.898 for the training set and a 0.889 score for the test set, just about the same as OLS.

ElasticNet got a score of 0.898 for the training set and a 0.889 score for the test set, along with OLS and Lasso

Linear Regression here we see the same score as OLS, Lasso and ElasticNet: a score of 0.898 for the training set and a 0.889 score for the test set

Random Forest: For fun, why not

Had to use a cross validation here to get an equitable score.

And finally! KNN:

A score of .994

Remember the R2 we discussed above? This is really where they all come in and we can take a look at all of them in a birds eye view. OLS, Lasso ElasticNet and Linear Regressions all had the same score so I looped them together. Slight variation with Ridge. Random Forest performed better than all the linear regressions but by far the best performed was the weighted KNN. Feel free to take a look at the train and test R2 below.

Now for the purposes of this project, I ended up choosing K Nearest Neighbors but there are a few things to keep in mind with KNN:

I hope you enjoyed this walk through a model selection process for the purposes of predicting using machine learning. The images here were taken from my jupyter notebook and slide deck I made to go along with it.