## LSTM IN QUANTUM PHYSICS | TOWARDS AI

# What does Bidirectional LSTM Neural Networks has to do with Top Quarks?

## And how it turned out that looking at a sequence of vectors in four dimensions from two opposite sides was the key to solve a decades-old problem

In a recent paper *Bidirectional Long Short-Term Memory (BLSTM) neural networks for reconstruction of top-quark pair decay kinematics* (preprint: arXiv:1909.01144), my summer student Fardin explored a number of techniques to reconstruct the decay chain of a fundamental particle called top quark that is abundantly produced at the LHC. This particle decays preferably into a *W* boson and a *bottom* quark (*t*→*Wb*). The W boson, in turn, can decay into a pair of quarks (*W*→*qq*’) in two-thirds of the cases or a charged lepton and a neutrino (*W*→*lv*) in the remaining one-third of the cases. The most common process leading to the production of top quarks is the so-called *pair production*, which means that the result of a collision among the constituents of the proton (valence quarks and gluons) is a particle-antiparticle pair of top quarks. Hence, in about 4/9 of the cases, at the end of the decay chain there are two *b* quarks, two light (*u*/*c*/*s*) quarks, one charged lepton (*e*/μ) and a neutrino (*v*). This process is also called *semi-leptonic decay of a top-antitop pair* (*tt*→*WbWb*→*bqq’blv*). Also, the quarks can not be observed experimentally, but appear in the apparatus as collimated sprays of particles called as jargon *jets*. Sometimes, an additional jet appears, created by the emission of a gluon (the analogous of a photon for the strong force).

The problem is how to assign these six objects uniquely to each top quark. Traditionally, this is done by finding the permutation that minimizes an objective function called χ2 (“chi-squared”). Since a neural network is a universal approximator, we argued that some kind of deep network may be able to do the heavy lifting. Fardin concluded that the best candidate is in fact a stack of neural networks with memory, called bidirectional long short-term memory networks.

# Neural Networks with Memory

Recurrent neural networks (RNNs) address the problem of being able to re-process information in order to keep track of sequences. In fact, it is not necessary that the dimension along which they are “unrolled” represents actual time, but it’s a nice way to think of what’s going on.

The basic RNN unit has an input (*x*), a hidden state (*h*) and an output (*y*). The hidden state is initialized with a vector of all zeros, random numbers, or another vector (keep reading). The name of the game is to find the set correct parameters, represented by matrices *Wx*, *Wh* and *Wy*, that let the network perform the required operation. In mathematical terms, at a time step *t*, the hidden state *h(t)* is:

where σh is the activation function (typically the hyperbolic tangent *tanh*) that introduces a non-linearity. I assume implicitly that a bias *b* can be added to the argument of the sigmoid (or *tanh*) function. The output will be:

Where σt is another activation function (usually the same as above). The network keeps track of the hidden states [ *h*(0), *h*(1), … *h*(*t*) ] and the user can decide whether to output just the last one or the whole sequence. The state of one RNN cell can be used to initialize the hidden state of another cell as necessary in certain models such as the encoder-decoder network.

In practical applications, it turned out that this kind of basic RNN network is too basic and struggles to retrieve information in long sequences. To cope with the problem of the vanishing gradient, an improved type of RNN called Long Short-Term Memory (LSTM) was introduced by S. Hochreiter and J. Schmidhuber in 1997. The key to solve the problem was to increase the complexity of the inner workings of the RNN cell by adding a vector state *C* and a number of gates, *i.e.* operations that control the information flow. A gate is a non-linear function (usually a *sigmoid*) followed by a multiplication. A LSTM has three gates:

**Forget gate layer**: applied to concatenated of the input at the current time step*t*and the hidden state at the previous step,*i.e.**z*(t) = [*x*(*t*),*h*(*t*-1)]. Since the output is a number between 0 and 1 for each element, it controls the amount of information to be retained from the previous time step*t*-1

**Input gate layer**: similar to the forget gate, controls which elements of the state vector*C*have to be updated

With these functions, the state *C* is updated according to the following formula:

In other words, the state at the time step *t* depends on the state at the previous time step *t*-1, and by the “important” information that is presented at the time *t*.

**Output gate layer:**Finally, the hidden state at the time step*t*is computed, and output is provided if*t*is also the final time step (i.e. the last element of the input vector):

If the input vector has only one element, the hidden state *h* will have only one element. Instead, if the input has *N* elements, the hidden state will be a collection of all the *N* hidden states, and the output will correspond to the last one ( *O* = *h*[*N*] ). In common implementations, a LSTM cell can return handles to the state *C*, the list of intermediate hidden states *h*, and the final output *O*.

In problems where all time steps of the input sequence are available, **Bidirectional LSTMs** train two instead of one recurrent network. The first is trained on the input sequence as-is, and the second on a time-reversed copy of the input sequence. This can provide additional context to the network and hopefully increase the accuracy of the network.

To come back to Fardin’s discovery, we noticed that the input sequence is fully known, differently from example to speech recognition, where the input flows with time. It made lots of sense then to analyze the sequence from both ends at the same time. While the result is still only preliminary, it demonstrated the ability of this kind of network to learn complex kinematics with intermediate particle decays. Possible extension of this method involves the attention mechanism and transformer networks — that’s set aside for later post, though!