Machine learning at Cindicator: pipeline and developments

Published in

Cindicator

7 min readSep 4, 2018

In our recent update on the accuracy of Hybrid Intelligence predictive analytics we briefly outlined the direction in which our machine learning team is moving. In this post we will present a more detailed overview of what the ML team is working on.

1. Collective intelligence + ML = Hybrid Intelligence

Machine learning starts with data. We receive our data from the amazing community — almost 115,000 intellectual investors analyse the market and make forecasts on the Cindicator platform. But to collect the right data you need to ask the right questions. That’s where ML magic begins.

Before asking any questions about a particular asset movement, our analytical team runs a Monte Carlo simulation more than 10,000 times. The machine learning team, together with Cindicator’s developers, has built a very handy service for this and an API that can be used by different Cindicator platforms. This allows us to determine the prior probability of the event and gives us a starting point.

Our method is currently based on the daily historical volatility of the asset, but the ML team is constantly working on different extensions of this method. In other words, we can determine how probable it is that the event in question will occur. By doing that the analytics team can keep the difficulty of questions to a relatively similar level.

On the above picture we visualised an example of a Monte Carlo simulation. In this example we want to ask a question like: “The current price of Bitcoin is USD 7,190. Will it rise to USD 8,000 within 10 days?” The current price is shown as a yellow line (USD 7,190) and the target price as a blue line (USD 8,000). As you can see, different simulations give us different results. After grouping all results together we determine that the prior probability of this event is 31% — a relatively unlikely event. So in this example analytics can adjust the target price or target value of this question. Additionally, this prior probability value can be used later in the machine learning platform.

2. ML problems to be solved

So, we’ve asked the questions and collected a lot of answers. Now what do we do with this data?

From the machine learning perspective, we have three main problems to be solved:

Classification problem. This is when we need to predict the probability that a given event will occur. Additionally to binary classification we need to provide a probability between 0 and 1;
Regression problem. This is when we need to predict the future price of the asset so that the output is a continuous variable;
Clusterisation problem. We scan our userbase to identify bots and unfair behaviour.

The simplest approach would be to just calculate the median answer. That was the approach used by English Victorian-era statistician Francis Galton in his famous observation during a fairground game, where villagers guessed the weight of an ox.

Yet early on we understood that the accuracy could be significantly improved by taking into account the past performance of each analyst. Our data is already quite rich and it becomes more valuable as analysts build on their track records.

We have now devised over 30 different models for various types of questions. Here are some examples of the approaches we use.

1. Condorcet’s theorem-based model

We sort users by accuracy based on their previous performances, then in any particular event a model based on the Condorcet theorem takes into account only the users that have a proven record of being correct more than 50% of the time. As a result we take the median of answers from just these users.

2. Bayes’ theorem-based model

For each user who has participated in a particular event we calculate their prior probability of answering correctly as a ratio of their previous correct answers. Then, based on the answer this person has submitted we calculate the conditional probability (e.g. given that person has answered >0.5 what is probability of this person being correct?) Then, if the conditional probability crosses a certain threshold we can invert the answer of this user. We perform such calculations for every user who has participated in the event and at the end the model returns the median answer.

3. Different ranking models

We use many different approaches to distinguish top analytics from the general crowd. For example, different groups of users are better at predicting different assets. Or, in different market regimes different groups of users can be more accurate than others.

These models are pretty simple and they create the second layer on top of user answers in our machine learning platform.

3. Generate final prediction

We now have a pool of models, which contains several dozen models with different approaches at its core as well as different parameters. To generate the final prediction we combine them into a bigger ensemble model. As an ensemble model we are using a feedforward neural network.

The overall structure looks like this:

In this architecture the information moves only in one direction. Forecasts from the collective intelligence platform go to different ML models. The outputs from each model become inputs for the neural network. The final outputs are indicators that are sent via the Telegram bot to CND token holders.

The example above shows the simplest graph of the models, but our machine learning platform can create any graph as long as it’s directed and acyclic. Due to the fact that the crypto market is highly volatile, our machine learning pipeline must be able to react to market changes and remain flexible. Our pipeline is a basically a graph with dynamic nodes, and the structure of the final model can vary depending on market conditions, certain assets or when we want it to change.

4. What’s currently in development

The ML department is actively researching new directions and approaches we can apply to further increase the accuracy of Hybrid Intelligence predictive analytics.

1. Sentiment analysis

We are testing different approaches to monitoring Twitter in real time in order to predict black swan events in crypto markets. The idea is to use natural language processing (NLP) to quickly identify important news that can move the market. This would help us to anticipate sharp increases in volatility.

We have already built a fully functioning NLP platform for data collection and NLP analysis. The platform allows us to collect real-time data, monitor certain news events and users, and perform real-time sentiment analysis. This is a very important direction for ML development and we are working on this on a daily basis.

2. Pure ML for market data

We have developed several models that are currently at the testing stage to predict asset prices based purely on market data without using collective forecasts. The idea is to apply traditional technical analysis techniques using market data as an input.

We have already built several models for pattern recognition based on technical analysis and several classical models for time series forecasting, and are testing several deep learning approaches for stock predictions. These models will help us to correct the cognitive biases of analysts and capture market changes faster.

3. Different neural network architectures

The financial market is a complex system with a tremendous number of connections and causal relationships. It’s impossible to create some universal formula that can describe market behaviour. So what we can do is build a model that will be able to approximate this behaviour. Without knowing all the variables, we are trying to restore the initial distribution and from that predict the market direction. Currently neural networks are the strongest known universal approximator that can allow us to determine complex and non-linear patterns in data.

Our research is moving toward different neural network architectures that could take different kinds of input — not only analyst forecasts, but also real-time market data and social media sentiment — to combine it and come up with an accurate prediction. Interestingly enough, for such tasks we can use convolutional neural networks and generative adversarial networks (GANs) or autoencoders.

Every week in his columns Weekend of a Data Scientist and GAN of the Week, ML Team Lead Alex Osipenko shares examples of how some of these and other interesting models work.

If you are interested in learning more about the approaches we use, we recommend that you browse the following references to learn more. And we are always open to discussions in our chat on Discord. Finally, if you have any suggestions or ideas, feel free to comment or leave a response under this article.