Translating Machine Learning into Business — Part 1: Deep Learning & the Stock Market

Patrick Baginski
Coinmonks
14 min readAug 13, 2018

--

As you read this article, I will give you three warnings. If you continue reading past them, all hope is lost, and you have successfully identified as a geek like me.

https://www.kdnuggets.com/2015/n41.html

Here goes my first warning: If you read that title and thought “Here we go, another tutorial on how to implement Recurrent Neural Networks for Time Series prediction in Python using TensorFlow” then you were almost right. This article is indeed about neural networks and time series prediction, only, it is not about the intricacies of artificial intelligence engineering. What you’re about to get into is the interpretation of deep learning in a business context, for people who do not have a background in Math, Statistics or Computer Science. Now that is not to say previous exposure to these topics isn’t helpful, but it isn’t a prerequisite here. I can already hear the haters yell “Truly understanding deep learning requires true understanding of its disciplines — Can’t know if you can’t code it” — And they are right. However, the intention of this 4-part series on translating machine learning is not to educate others in its technical perspectives; it is to spark Data Scientists’ and Non-Data-Scientists’ interest to bridge the gap between science and business and, maybe, start asking simple questions like “So what?” as a daily component of our work. It certainly has inspired me to continue down the road to more fact-based business management (What the heck is “fact-based business management” you ask? For now, let’s just simplify it as using data-driven decision making and fast innovation cycles to keep the business and its people happy).

Throughout the next four articles, I will try to shed light onto what the key questions of some infamous components of Machine Learning are and how they translate into crucial business questions. As Strategy and Business goes, the same questions may pose different interpretations depending on the context they are applied to. Hence, we will use practical examples to illustrate these questions. In this part, we will talk about neural networks and the prediction of US equity prices, as I’ve warned you above. In the second part, we will talk about Reinforcement Learning. The third part will be about a decade-old method known to both Data Scientists and Business folks alike as “Exploratory Data Analysis”, which only gained in importance given the increasing complexity of the economic environment firms act in. In the last part and probably the least technical one, I will talk about the route to scientific business management. This is an idea that I’ve pondered for a while but never really got to articulate properly until now, as ML is making this concept more important than ever. Or, as Mark Cuban says:

“Artificial Intelligence, deep learning, machine learning — whatever you’re doing if you don’t understand it — learn it. Because otherwise you’re going to be a dinosaur within 3 years.”

That was 1 year ago. We have only two years left before we end up in the Jurassic Park of Data Science, so we better get started.

Translating Machine Learning into business strategy is not an easy task and in fact, it is an active field of research on both ends of the spectrum, from science to management. While ML researchers are currently very involved in making models and their results more human-interpretable (look here), economists are investigating the intersection of behavioral psychology and human-decision making (here). While I have yet to come across a true representation of the democratization of data science (here, despite this common claim by many, the field seems yet to be reserved for people with extensive knowledge in CS, Statistics or Math), learning data science surely has become easier than ever. My personal belief is that the best way to learn how to drive business insights from ML is by discussing the complex relations that exist between the two with the community.

My second warning: In the following we will talk about stocks and neural networks. While it is not a pre-requisite to understand their inner workings, I assume that if you’ve made it this far in the article, you have a basic understanding of what they are and how they work. For simplification, let’s look at a human-readable definition of neural nets for us Non-Data Scientists:

“Neural networks are a class of machine learning models that predict an outcome based on a sequence of calculations that are arranged very loosely following the architecture of the human brain.”

Let’s exemplify this using the stock market. With its typical attributes (micro- and macro- swings in the price curve for example) it is a commonly chosen target of deep learning tutorials and an ongoing challenge for data scientists. However, stock prices are highly unpredictable and volatile with little ability even for modern deep learning architectures to consistently and correctly predict more than a day into the future. To quote Burton Malkiel, who argues in his 1973 book, “A Random Walk Down Wall Street,” that if the market is truly efficient and a share price reflects all factors immediately as soon as they’re made public, a blindfolded monkey throwing darts at a newspaper stock listing should do as well as any investment professional. Back to our neural network. Below is a high-level example of the structure of a simple neural network, with a single perceptron.

https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/

In our stock price world, this means that the inputs (x-numbered) are our information such as prices, indices, the price of US oil, or even opinions and sentiments scraped from analysts writing on their blogs. The outputs would be the stock price on a certain day and the perceptron holds a few statistical and mathematical operations for inferring said stock price from said inputs. Information in this type of network flows “forward” only. What this means is that when we feed the model information such as indicators, the press releases or maybe news headlines from the economy section of different papers, this information is processed in the perceptron to achieve the output of “predicted stock price”. However, the “predicted stock price” cannot flow back into the perceptron as an input and neither can the information we initially fed the perceptron be “used” again (the press releases & news that we input for example). Since we are however talking about “recurrent” neural networks, there is a little more to it than just the above. Below is an example of the architecture of a typical Recurrent Neural Network largely used for time series prediction, or in our case, for trying to predict stock prices.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

This network shows its “recurrence” in the loop that you can see, with a block now replacing the perceptron, Xt being the input (e.g. our stock infos) and Ht the output (e.g. our stock price predictions).

The difference between this and the more special RNN, called LSTM network, is that LSTM has the ability to “store” long-term information, something that the standard neural net cannot do (and even the standard recurrent neural net can’t do it — hence the special Long-Short-Term-Memory version). This ability is basically inherited in the perceptron like structure, in the LSTM case called an LSTM block. The other difference between a plain-vanilla neural net and a recurrent one is the fact that information in the network can flow “backwards”, or, in simpler words, the input information we feed the network can be reused by the network and the output of a block can equally be used by the block as an input. Now, if this were just your average RNN stock price prediction tutorial, we would now proceed to talking all about what the intricacies of this block are: Block state, activation functions, bias, weights, forget gate, output gate, input gate, …. And in fact, you can soon check back here for the Medium post I wrote that is the TensorFlow code tutorial on this topic. In this post however, I want to focus our attention onto what these components mean from a business perspective and how a non-technical person interested in the stock market would think about them in their decision-making process. For this purpose, I will give a brief, human-readable description of what each one is and then provide you with some examples of questions that influence this component and ultimately, the prediction of a stock or an index.

https://www.semanticscholar.org/paper/LSTM%3A-A-Search-Space-Odyssey-Greff-Srivastava/0be102aa23582c98c357fbf3fcdbd1b6442484c9

While the above typical LSTM architecture may seem intimidating at first, let’s go into the components one by one as described above. So how does stock price prediction really work in a deep learning context such as this one?

Inside of the block we have information flows, statistical & mathematical operations, as well as inputs & outputs. We can see that information such as stock prices, indices, press releases, news info, sentiment analysis and others can flow into the block as an input (the outside arrows coming into the box). From a business perspective, this is probably the single most important influence you can take on the prediction of the stock market. The business view of what kinds of information to use to “train” the network (more on this later) is almost a philosophical one, but the data scientist’s perspective is to “let the network decide what information is relevant”. That is, if we have a dataset consisting of the S&P 500 with daily data from the last 5 years and the news headlines from every day of the last 5 years, how can you decide if either of this information is relevant to predicting the index of the S&P 500 tomorrow? Intuitively one would say “Of course, history is the greatest teacher. And why would Elon Musk losing his temper again via Twitter not influence the price of Tesla tomorrow?”. While this is not entirely wrong, machine learning research has shown that humans are not exactly great at “choosing” the right information to build a model, that is, in comparison to letting machine learning help humans choose the right information. So that is what the business people would think right now: Let’s gather all high-level relevant seeming information such as prices, sentiment, news, company data, etc. and use these as inputs to predict the S&P 500 tomorrow. What the Data Scientist will make of this is to build the network in a way, that certain LSTM blocks (like the one above) are chosen by the model to be more or less influential to the S&P 500 price tomorrow. This can happen in multiple ways, such as e.g. weights to each of the operations happening in the block or more radically even fully deactivate (“dropout”) a block. Note that weights are also assigned to information flows outside of the block and they are critical to the process of the network “learning” how to predict the stock price. We’ll get into that later when we touch upon “back propagation”. For now, let’s just say as the model “learns” it updates the weights at the arrows and inside of the block. Next, a crucial component of a block are the activation functions (those greek symbols in the LSTM block). There are many types of activation functions and they are a component of all neural networks. While their mathematical composition is beyond the scope of this article, let us briefly discuss what they do. You can think of them as gatekeepers to the different gates where the LSTM block lets information in or blocks it. Here’s a general explanation of what they do:

Activation functions basically decide whether a block should be activated or not. Whether the information that the block is receiving is relevant for the given information or should it be ignored.”

This purpose is fulfilled at the different positions in the block where information can enter. The information that enters a block can be different things:

  • Input: This is our defined data features, such as index prices, news, press releases, stocks etc.
  • Previous-block output: This is the output of a previous block (in the case of a multi-layered LSTM, which, they all are. We’ll get into it a little bit further down.)
  • Previous-block memory (the “recurrent” arrows in our picture): This is also an output of the previous block, however, for now let’s call it “untreated” output.

Forget gate — This gate takes two inputs, the information we feed the network (e.g. stock prices, news, press releases, …) as well as the output of the previous block (e.g. depicting the relevance of stock prices, news and press releases on our S&P 500 price one step earlier). The activation function at this gate determines if this information is relevant or less relevant to the context of the current block.

Input gate — Similarly to the forget gate, this gate takes the same information in as the forget gate, however, it is responsible to transforming this information into the relevant part that the block should keep in its “memory”.

Output gate — This gate takes as an input again, the same two pieces of information as the other two gates, but it also takes the current state of the blocks memory (the portion the forget gate and input gate transformed) as an input. It does so to amplify the filtered, relevant information from the inputs, or, in other words, make sure that only the truly relevant piece of information comes back out of the block.

Alright, this was a lot of information and you might feel like it isn’t clear yet what the LSTM block really does. Let me try to summarize all of the above in more simple language before we move to the grand finale:

The LSTM block takes in new information (e.g. news, stock prices, press releases, …) and/or the information of a previous block (e.g. the news, stock prices, press releases the previous block looked at and decided its relevance), filters out irrelevant information, memorizes the relevant portions and outputs not only the newly relevant information but also a combination of this with what the previous block thought to be relevant portions of it. In other words, it takes the previous memorized information, adds new info to it, decides if it’s still relevant, then keeps the relevant portion of it and spits out a combination of all relevant information so far.

Phew. We’re almost done. We’re just missing one crucial piece of information. Here goes my last warning: If you finish this article, chances are you’re going to spend much more of your time on LSTMs than I have taken from you with this one here (I do appreciate your perseverance!).

We spoke a lot about time series in the beginning and why LSTMs are so perfect for time series modeling. At this point you might already have an idea why, however, let’s unpack why LSTMs are both so great at this and why businesses are struggling to understand them for lack of interpretability. Below figure shows you the architecture of an LSTM network with multiple blocks.

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/

This one has only three blocks but what we can see is that the first block takes in information from, e.g., two days ago. The next one then takes information from yesterday and the next one from today and so on. This way, the network “transforms inputs into predicted stock price”. Each block takes the inputs and transforms them to meet what the stock price from e.g. two days ago, yesterday, today was. So, to go with above LSTM block explanation, the first block takes the S&P 500 price from two days ago in addition to the information we pass it (news, press, weather — you name it), and then strips irrelevant information that doesn’t help it “understand” the S&P 500 price on that day. The next block then takes this info as an input, in addition to further info we pass it and repeats the same process. This way, the network learns what information is relevant to the price of the S&P 500. This is also one way in which the data scientist lets the network decide what information is relevant, instead of going by intuition. Lastly, we could then have the network tell us the S&P 500 price movement for tomorrow, using the information it just learnt from the previous blocks. But how does the network “know” if its prediction was accurate? This is where a principle called back propagation comes in handy. In layman’s terms back propagation means the following:

The network compares its predicted stock price to the actual stock price on a certain day by basically looking at how wrong (“error”) it was. “Wrong” in this case means it compares how many times it correctly predicted the stock movement to how many times it incorrectly predicted it. Then, it “back propagates” the information into the memory blocks and updates the weights that are attached to each of the information we pass the network and the weights that are inside the activation functions in order to minimize this error. It does so until we tell it to stop or, alternately there is no further improvement to the error. The process of updating the weights through back propagating information is arguably the least intuitive mathematical operation in a neural network. The inner workings of this is beyond the scope of this article but you can find more information (both in math and code) here.

Alright. You made it! Now why is this so hard to interpret? Because what we used here is a very simple depiction of an LSTM and in fact, it would perform very poorly at predicting the direction of the S&P 500 tomorrow. In reality, powerful LSTM networks look more like the below:

https://stats.stackexchange.com/questions/304585/what-are-blocks-of-an-lstm

Now this is very easy to interpret if we’re only looking at the input nodes (our data on stocks, news, weather, press, Elon Musk). But it becomes very challenging once you’re a few layers down the path. Imagine you find yourself at hidden layer #2 in this example. The information an LSTM block takes in here from a previous block is already a convoluted version of the information the previous block took. It is not definable as “News”, “Press” or “Price” anymore. Modern LSTM networks can have up to 1000’s layers, making it virtually impossible to bring the workings of a later-stage memory block into a human readable format. For a technical perspective on this look at Andrej Karpathy’s famous blog post “The Unreasonable Effectiveness of Recurrent Neural Networks”. Or, as my professor in Data Science at Cornell Tech put it:

“The biggest challenge for AI researchers is the lack of human-readable output of what is happening in heavily convoluted networks — We just don’t know why the machine chose an information to be relevant over another if it’s too deep in the network”.

I hope you liked this post and if you have comments, questions and other interesting perspectives on how to bring business closer to data science, use the comment section below, as I am very active in there.

--

--

Patrick Baginski
Coinmonks

Translating AI & Machine Learning into growth | "Study the ways of Data Science you must" | Traveler, Foodie & Polyglot