how to collect data for cryptocurrency algorithmic trading, and what to collect?

Zihan Guo
Data Alchemist
Published in
7 min readAug 9, 2019

When google searching “how to get crypto data”, most articles probably focus on selling their data providing services. For example, Kaiko provides €699 a month subscription service with at most 3 months historical data.

However, what if you want to build a crypto trading bot from scratch. How do you collect, clean and store your data? How can you make sure that when your machine crashes or when your internet went down, your data storage pipeline is not interrupted?

This article focuses on the following objectives:

  • How to collect data to start an algorithmic trading bot on cryptocurrency: the general architecture of building an MVP (minimum viable product) data collection and storage pipeline.
  • What data to collect?
  • Some safe practices in the data collection pipeline.

This article doesn’t:

  • share code on the implementation details. By now, there are hundreds if not thousands of cryptocurrency exchanges that you can hook up with and collect data from. The objective is to follow the doctrine: “Give a man a fish and you feed him for a day. Teach a man to fish and you feed him for a lifetime”. i.e. I am lazy…
  • tutorial on how to register an AWS account and how to launch an ec2 instance. The only reason I am writing this article is that I haven’t found another one that talks about how to collect crypto data from the ground up. Plus, all those tutorials on the set-up are easily accessible on Google.

Pre-Architecture Setups

Unless you are blessed with an already existing data center at your hand, cloud computing offers a cheaper and more immediate solution than acquiring and managing machine servers yourself. For your convenience, here is a link on how to create and launch a new ec2 instance on AWS link.

my self-drawing, limited copyright

Once you have the ec2 instance running, use Git to version control your codebase. If you don’t have Git experience, it is critical to understand why version control matters and how to do it properly. Use this link as a reference point as it gives a comprehensive view of Git.

If you are a vim or emacs master, then you can ignore this section because you can just develop code right on the ec2 instance. However, if you are more a lover of using code editors like PyCharm, VStudio or Sublime, you might want to set up git rsync to automatically transfer local copies to the remote ec2 instance. Alternatively, you can set SFTP or FTPS for your editor, I am including PyCharm’s set up here, but other editor’s set-up can easily be found as well.

What data to collect?

Now, the basics are gone, how do you actually get data. And even before that, what kinds of data do you want to collect?

There are four essential types of financial data: [1]

  • fundamental data
  • market data
  • analytics
  • alternative data

Cryptocurrency doesn’t have fundamental data in the traditional sense since the idea of decentralization is to not have a centralized entity and therefore it is difficult to assess the intrinsic value a “company” holds.

However, there are still fundamental metrics available that might tell us something about its intrinsic value. We will explain this later.

I really like Buff Pelz Dormeier’s explanation on the difference between fundamental and technical analysis so I am going to share it here:

An analogy can be drawn between a fundamentalist and a technician who both examine a high-performance automobile. The fundamentalist looks under the hood, kicks the tires, and inspects the frame — the physical aspect of the car. The technician does not look under the hood. Rather, he evaluates how the car performs under a set of conditions, such as turning, accelerating, and shifting. The fundamentalist examining the engine notices a potential flaw in the engine design. Similarly, when the gauges exceed the threshold of the expected parameters, the technician is led to the same conclusion as the fundamentalist, but without a physical inspection of the engine.

— Investing with Volume Analysis by Buff Pelz Dormeier, CMT [2]

So the fundamental metrics on a cryptocurrency must “look under the hood” and inspect the internal mechanism of the “automobile”.

Here are some suggested ways to research and collect those data: look at the cryptocurrency’s original whitepaper to understand what functionality does it provide, then research about the existing competitors on the market.

Credit to CoinDesk

However, in the end, we are talking about doing everything in code. Reading stuff and researching stuff doesn’t seem too enlightening on how to do this programmatically. What are other ways that we can get fundamental metrics?

One way to measure the value a certain cryptocurrency is providing is to look at its developer community. How many active commits have been pushed to master, how many pull requests were created. What is the speed of development for a cryptocurrency? How about the social community: how many active conversations are going on in the various chatting platforms on telegram, discord, Reddit, Twitter, Facebook groups, Linkedin groups, etc. What about the network transaction volume and market capital? If no one is using it thorough on-chain transactions, where does the transaction value go? How about the gas fee? All of these questions can be turned into quantifiable metrics and recorded into a database for later analysis. I have listed CoinDesk’s spider chart that shows ETH and LTC’s overall fundamental metrics when compared with BTC.

Market data usually is consisted of price, volume, interest rate, fee, quote, trade, announcement (for system maintenance or newly added trading asset). To collect these data requires consistent and steady work as it involves reading and exploring documentation on the exchange API website.

Analytics in crypto is abundant, there is numerous signal providing services online that market their signal to be the golden one. One usage of these analytics data is: we can use these signals as a confirmation signal during the volatile market similar to a voting mechanism. Alternatively, we can use this data to trade against the trend: if the majority of the signals are predicting LONG, then who is selling? If no one is selling, that sounds very tulip to me.

Qokka AI: https://cryptoqokka.com/

Alternative data is usually the differentiating factor between a small and a more mature trading group. The first three kinds of data are fairly easy if put into enough efforts to collect. However, the last kind of data, by its name (alternative) is vast. Many hypotheses can be formed:

social media sentiment (Twitter, Reddit, etc). Google search results. Friends of mine have built Wechat Bot to collect sentiment from Wechat group. If you don’t want to build sentiment score pipeline from the ground up, Qokka AI “collects and analyzes social data at large scale with machine learning and natural language processing”[4] and provides sentiment score of data from various social media such as Twitter, Reddit, and from even telegrams and discord. Whale alert on large transactions on-chain, side-news from friends that a new token will soon be launched, and stories go on and on, endlessly.

Back to the architecture: ec2 instance launched. Now, what to do? For market data collection, one probably wants to use WebSocket instead of REST for the advantages listed here but the key takeaway is that you are sending fewer requests with more streaming (so faster) data.

Now, you have the data holding in your process memory, what do you do? We want to periodically transfer this data to S3 bucket ideally without occupying the instance disk space. If you are coding in python, consider using I.O. package to directly dumping data from memory to S3 bucket.

In the long term, we want to periodically transfer the very old data to infrequent data storage or S3 Glacier to save money.

My self-drawing, limited copyright

Also, if you want to set-up a database, you might want to consider EBS with persistent data storage. Just be aware EBS can only be mounted to one instance. For the sake of MVP, we don’t go in-depth into different data storage options there. If you are an expert in Google Cloud Platform or Azure or Alibaba’s CC service, use the equivalent to do fast prototyping.

Ending Note

We are a private part-time cryptocurrency coding group. Members include current quantitative trader from JP Morgan, previous OMM quant from Citi, machine learning infrastructure engineer from Linkedin, and myself.

Our objective is to learn and grew expertise in algorithmic trading in the cryptocurrency space. We do not belong to any entity and this is just a hobby oriented group.

However, if you are interested to be involved, we are very friendly. Please drop me a message here, and happy to chat.

Reference

[1] Lopez De Prado: Advances in Financial Machine Learning, pg.24

[2] Buff Pelz Dormeier, CMT: Investing with Volume Analysis, pg.10–11

[3] Diagram Credit: https://www.coindesk.com/data

[4] Qokka AI: https://cryptoqokka.com/

--

--