Cryptocurrency data exploration in Jupyter

Nick Wienholt quantivfy.eth
8 min readJul 20, 2022

--

Photo by Art Rachen on Unsplash

A lot has been written about the goodness of Jupyter. The definitive proof of its robustness and quality was the award of the Nobel Prize in Physics 2017 to the LIGO and VIRGO teams for proof of the existence of gravitational waves. Earning a Nobel Prize in Physics is orders of magnitude harder than scraping a bit of alpha out of the stock market or cryptos, and there is little room left for debate over its superiority as a product for data analysis and modelling.

The main competitor to Jupyter is Excel, and despite being a dreadful tool for most purposes, Excel still has some niche uses. I personally use Excel for the following use cases:

  • for simple data analysis like Pivot table and LOOKUPs that need to be shared with a non-technical stakeholder. Sharing a Confluence or Github link to a Jupyter notebook with a business analyst just leads to a blank stare in most cases. Excel is an acceptable form of communication in most corporate environments, and is fully self-contained.
  • when results look wrong in Jupyter, I like to do a full re-implementation in Excel to see if I get the same results. Excel functions are different enough to Python to force a full independent re-implementation — another procedural language like C# is too similar, and will just result in the same errors being coded in (if there are any). Once the analysis has been validated, the Excel file is deleted.

Outside of these use cases, Excel is total crap. Its poison. Its fundamentally flawed on many levels, and should never be used for serious data analysis, especially related to trading scenarios. Being sent a screenshot of an Excel chart to “prove” the robustness or profitability of some strategy has the intellectual weight and veracity of a crayon drawing. There are dozens of documented cases of Excel “results” leading to totally wrong conclusions because of the ease of making these mistakes, and the difficulty in identifying and debugging them — the most famous case is probably the inane austerity applied to southern Europe during the PIIGS crisis that all stemmed from an Excel error. Even during the recent COVID pandemic, Excel managed to do its normal excellent job of data destruction.

Getting started with Jupyter is very easy, and totally free — simply download Anaconda and Visual Studio Code. The Spyder IDE that comes with the Anaconda distribution is fine if that is your preference, and the upcoming Fleet release from JetBrains may prove interesting, but the advances in VS Code and its compatibility with Github Code Spaces (which will be covered in a future post) make it absolutely the preferred IDE for me in mid-2022. There are dozens of great YouTube videos and articles on getting started with Jupyter.

Jupyter in Practice — Examining Bull and Bear Market Price Behaviours of BTC and Altcoins

A casual observation of the daily price movements in crypto markets leads to the observation that altcoins are more volatile than bitcoin, and are, in the short term, essentially a leveraged play. This would lead to the intuition that a market neutral strategy could be developed that was short BTC and long a basket of alt-coins in bull market conditions and vice-versa in bear market conditions. Even a very simple strategy idea like this leaves a lot to unpack:

  • while identifying bull and bear markets in retrospect is trivial, identifying them in real-time is much more challenging.
  • how is a basket of altcoins chosen and re-balanced. Obviously choosing the altcoins with the largest market caps today introduces incredible hindsight bias into any analysis, as these coins have a large market cap because they have performed so well. A simple and intuitive criteria needs to be used to select an appropriate basket at a historical time point.

Getting historical market capitalisation data for altcoins is hard. The amount of coins available in a particular crypto-currency varies according to the rates of mining and burning for that particular currency, and is hard to determine historically at scale. The inability to determine historical coin volume makes calculating market cap infeasible when the only data available is coin price. CoinMarketCap.com has an API that can be used to retrieve historical market cap, but data older than 12 months is only available in their Enterprise tier which has an “ask us” price.

A data source exists on Kaggle, and this is serendipitous in the sense it was created five years ago, and no additions or deletions have been made to the dataset to take into account the most recent bull-and-bear cycle from early 2021 to mid-2022. This dataset was downloaded from Kaggle and all the data was read into a Pandas DataFrame.

allCryptos = Nonefor root,dirs,files in os.walk(os.getcwd()): for file in files:  if file.endswith(“.csv”) & file.startswith(“coin_”):   if allCryptos is None:    allCryptos = pd.read_csv(file)   else:    allCryptos = pd.concat([allCryptos, pd.read_csv(file)])allCryptos[‘Date’] = allCryptos[‘Date’].astype(‘datetime64[ns]’)

For all the cryptos in the download, the market cap on 1 Jan 2021 is retrieved and sorted:

allCryptos[allCryptos.Date == np.datetime64(‘2020–12–31T23:59:59.000000000’)].sort_values(‘Marketcap’)

As these values represent a relatively sensible set of coins untainted by the most recent bull market, the six coins with the smallest market cap in this sample will be chosen to form a basket, and this basket is compared against Bitcoin in both the bull phase (defined as 1 Jan 2021 to the current peak price of BTC on 9 Nov 2021) and the current bear phase (post-9 Nov 2021 to the current date of 13 July 2022). Price data for each of the cryptos was downloaded in CSV from Yahoo Finance and read into Pandas DataFrames.

aave = pd.read_csv(‘AAVE-USD.csv’, parse_dates=True, index_col=[0])
atom = pd.read_csv(‘ATOM-USD.csv’, parse_dates=True, index_col=[0])
btc = pd.read_csv(‘BTC-USD.csv’, parse_dates=True, index_col=[0])
cro = pd.read_csv(‘CRO-USD.csv’, parse_dates=True, index_col=[0])
doge = pd.read_csv(‘DOGE-USD.csv’, parse_dates=True, index_col=[0])
miota = pd.read_csv(‘MIOTA-USD.csv’, parse_dates=True,
index_col=[0])
sol= pd.read_csv(‘SOL-USD.csv’, parse_dates=True, index_col=[0])

A few quick checks were performed on this data once it was loaded:

def PrintNanAndTypeSummary(name, df, throwOnError = True):
res = df.isnull().values.any()
print("{} contains NaN values:\t{}".format(name, res))
if res & throwOnError:
raise ValueError("DataFrame contains NaN value")
foundObjectCol = False
for dtype in df.dtypes:
if not is_numeric_dtype(dtype):
foundObjectCol = True
print("{} contains object columns:\t{}".format(name, foundObjectCol))
if foundObjectCol & throwOnError:
raise ValueError("DataFrame contains object columns")
PrintNanAndTypeSummary('AAVE', aave)
PrintNanAndTypeSummary('ATOM', atom)
PrintNanAndTypeSummary('BTC', btc)
PrintNanAndTypeSummary('CRO', cro)
PrintNanAndTypeSummary('DOGE', doge)
PrintNanAndTypeSummary('MIOTA', miota)
PrintNanAndTypeSummary('SOL', sol)

There are two interesting features of this code — the first is the two checks for invalid data. When Pandas reads in a CSV, the column type is inferred, so if a column contains mostly numeric types, but there are a few “N/A” or “-” values, the data will be read successfully, there will be no NaN values, but the column type will not be numeric. If a column value is missing as expressed by no characters between two commas, the column type will stay numeric but a NaN value will be stored in the DataFrame. Both these error conditions need to be checked for independently.

The second interesting point to note is the use of exception throws when an error occurs. This is important for when the data that is fed into the notebook is updated months or years later, the ‘Run All’ option is selected in Jupyter, and the final graphs and cells of the notebooks are the only output that is analysed. Its very easy to miss a print statement somewhere a few screens up outputting an error. With the exception being raised, all subsequent execution is halted until the error in the data is resolved.

It’s also worth checking that every day is present in the dataset, and only appears once. While handling multiple entries for a day is trivial, missing or repeated data could indicate more serious issues and should be identified and understood.

def CheckDateRange(name, df, throwOnError = True):
everyDay = pd.Series(data=pd.date_range(start=df.index.min(),
end=df.index.max(), freq=’D’))
mask = everyDay.isin(df.index.values)
hasMissingDays = everyDay[~mask].any()

indexLength = len(df.index)
indexLengthInDays = df.index.max() — df.index.min()
indexLengthValid = (indexLengthInDays.days + 1) == indexLength
everyDayPresentOnce = indexLengthValid & ~hasMissingDays print(“Every day is present for {}:\t{}”.format(name,
everyDayPresentOnce))
if throwOnError & ~everyDayPresentOnce:
raise ValueError(“DataFrame contains missing or repeated dates”)
CheckDateRange(‘AAVE’, aave)
CheckDateRange(‘ATOM’, atom)
CheckDateRange(‘BTC’, btc)
CheckDateRange(‘CRO’, cro)
CheckDateRange(‘DOGE’, doge)
CheckDateRange(‘MIOTA’, miota)
CheckDateRange(‘SOL’, sol)

The full version of the analysis in the notebook at https://github.com/quantivfy/PublicNotebooks/blob/main/Crypto2/Crypto.ipynb also includes the use of the pandas describe function to check each DataFrame has data within the expected range in each column.

Conducting the Analysis

Now the relevant data has been identified, loaded and validated, the analysis can be completed. The bull and bear market phases will be analysed independently. Crypto coins operate on vastly different scales for a USD valuation per coin, so to allow sensible scales to be compared, a new column with the coin’s value scaled to 1 on the first day of each market period is calculated. For the altcoins, this scaled column is then averaged to create an index for the basket.

def CreateBullMarketDf(df):
dfBull = df[(df.index >= ‘2021–01–01’) & (df.index <= ‘2021–11–9’)].copy()
dfBull[‘RelativeClose’] = dfBull.Close / dfBull.Close.iloc[0]
return dfBull
btcBull = CreateBullMarketDf(btc) #repeat for all coins
...
atlCoinIndex = ((aaveBull.RelativeClose + atomBull.RelativeClose +
croBull.RelativeClose + dogeBull.RelativeClose +
miotaBull.RelativeClose + solBull.RelativeClose) /
6).to_frame('RelativeClose')

Resampling these values to a weekly average and plotting them demonstrates a massive outperformance for the alt-coins.

Looking at each altcoin individually over this period shows that two coins delivered massive outperformance.

Repeating the analysis with these two coins removed produces a similar result, but at a different scale.

For the bear market, BTC did hold up better, but the relative performance difference was much smaller.

The correlation of the alt-coins during the downturn was relatively high.

So the initial hunch is confirmed — Bitcoin was better in the most recent bear market, and altcoins dominated incredibly in the bull market. The next step in the analysis is to evaluate various indicators that determine market phase transitions, and look at the lag of these indicators — this will form the topic of the next post.

--

--