How to Properly Download and Validate Cryptocurrency Price & Volume Data

Dmitriy
5 min readNov 29, 2021

--

In this post, I will explain how to properly obtain cryptocurrency price and volume data for use in trading and statistical modeling.

Downloading the data

Let’s start by grabbing 200k rows of 1-minute candles from coinbase pro. We’ll use ccxt for the data download with a sleep function every half a second in the loop to try to prevent rate limits. We’ll monitor progress with a few tricks of the standard lib. I did not include the library imports in the code for practical reasons.

Now that we have a workable chunk of 1-minute bars to work with, we can proceed to validate it.

Validating the data

df = pd.concat([pd.DataFrame(d) for d in data])
df = df[:-1] # omit the last row since it's not a complete candle
assert df.shape[0] >= 200000; assert df.shape[1] == 6

Here we use simple assert statements to validate the general shape, number of rows, and the presence of six columns including the date. Assert statements will throw an assertion error if a condition is not met, and will do nothing if it was met.

Next, we assign columns names, change the timestamp to a readable format and drop duplicates.

df.columns = ['Date','Open','High','Low','Close','Volume']
df = df.sort_values(by='Date')
df = df.drop_duplicates(subset='Date').reset_index(drop=True)
df['Date'] = pd.to_datetime(df['Date'], unit='ms')

Now here’s one of the biggest mistakes people make when working with financial data: not validating the continuity between time intervals. Here’s how you do it:

This calculates the difference between two consecutive rows then gets the value counts for each time difference interval. We can see that most of the data is indeed a 1 minute time interval as we requested. We can also see that there are 4 instances of gaps of 2 minutes, and 1 instance of 58 minutes missing. Let’s take a closer look at that.

The above dataframe shows us the dates and time which have data missing prior to that period. For some reason, coinbase pro decided to give us these malformed dates. What do we do now? We grind it till we find it. We go to different exchanges, grab the data from there and insert it into the original dataframe.

In the above code, I made a function to find the timestamps of all the missing dates, get them from binance, and insert the missing pieces into the original dataframe. Keep in mind that only data for popular tickers will be available from more than 1 exchange.

A sanity check to make sure the function works:

Perfect, now every interval between each row is 1 minute long!

There are a few more things to do before moving on to data exploration. First is to check datatypes with df.dtypes. In our case, they should all be numbers, and if not we’d have to find and destroy the non-numbers. Second, we look at some basic statistics for all the columns with df.describe() . It should look something like the following:

Everything seems fine except the Volume column. Here we’re just looking for the presence of low or extremely high numbers in the min, max, and standard deviations. The Volume looks very suspicious. The 1-minute volume for ethereum ranges from 0.3 eth to 6756 eth per minute. 6756 eth would be the equivalent of ~$20 million, in 1 minute. Perhaps someone got a little trigger happy with the market orders. We do know that a few billion dollars in volume per day is not uncommon for eth, and that sort of volume per minute is indeed possible. Let’s explore the data some more.

We already checked for continuity in the time column. Continuity in price can be validated visually.

Plotting the price data tells us that the prices are indeed continuous and there aren’t any gaps.

Plotting the volume helps us to see exactly when the Volume anomalies happened. For further validation, we could plot these against data from other exchanges to ensure the anomalies happen in the same spots on the chart.

As you can see, getting good data can be a challenge. In this post, I’ve shown that even popular APIs such as Coinbase can give you malformed data. Getting proper data is the first step in any downstream modeling efforts, yet it’s often overlooked.

If you’d like to know how to use this data to start creating your own automated trading system, check out some of my other posts.

--

--