Generative AI: The Generality of Transformers, and Applications Beyond Language

15 min readOct 21, 2023

Nicholas E. Sherman (StarX), Ioannis Papapanagiotou (Steel Perlot), Rustin Domingos (StarX)

Introduction

The massive success of generative artificial intelligence (AI) models, such as ChatGPT and DALL-E, has transformed the landscape of AI, and has attracted the attention of a wide range of people beyond the field of AI. One key breakthrough was the transformer architecture, used for the task of language translation in 2017. The following year, the transformer architecture was used by OpenAI in the famous Generative Pre-trained Transformer (GPT) model, achieving state-of-the-art (SoTA) precision on a wide variety of natural language processing (NLP) tasks. One striking result of this work, was that the predominate training task for GPT was the task of next token prediction, yet with a fine-tuning process requiring far less compute, this model achieved SoTA results in other tasks such as commonsense reasoning (Stories Cloze Test), question answering (RACE), and textual entailment (MultiNLI). Since then, there has been an explosion of large language models utilizing the transformer architecture, such as BERT, BART, PaLM, Chinchilla , LLaMA, and Llama 2, and many others. There is also a large and growing number of open source transformer models, including some of the models previously listed, thanks to the Hugging Face community.
The transformer revolutionized the field of NLP and has become the default architecture for NLP tasks, analogous to the impact of the convolutional neural networks (CNN) on the field of machine vision.

This success of the transformer for NLP tasks has inspired the use of this framework for AI tasks beyond just natural language. Perhaps most notably is the vision transformer (ViT) by Google, achieving SoTA on image classification tasks. The model architecture used was a slight modification of the BERT model for NLP tasks, adapted to handle images instead. More recently, the transformer architecture has been used to process video data, and a review summarizing the state of this work can be found here.

The massive success of transformers in recent years, from NLP to vision and beyond, has made us wonder about the flexibility of this framework, and where else could we apply this technology. Here, we will discuss the underlying framework of tokenization and vector embeddings that enable the transformer, and discuss the generality of this framework. We will then discuss how this framework could be utilized by various problems in other domains such as stock price predictions, and simulating quantum systems.

Tokenization

One of the first steps in the text generation task is known as tokenization. Language is a natural format for humans to process information, but this is not the case for computers. The idea behind tokenization is to say that language consists of two parts. First, that a language has a vocabulary, which is a set of words that encode semantic meaning. Second, that language is a sequence of words from the vocabulary. So the process of tokenization is to enumerate a set of words, and then we can construct sentences or paragraphs as a list of indices, where each index corresponds to a word. One common technique to enumerate a vocabulary is the byte pair encoding algorithm, but we will just assume we have a vocabulary for the sake of this discussion.

Then, the sentence “I own a dog” would take the form [0, 2, 6, 4]. In the context of language, fractional indices have no meaning, and a more appropriate way to encode this sentence is with one-hot vectors. This leads to the following

where the row denotes the order of the sentence, and column denotes the vocabulary element index. So the third row of this matrix is saying that the vocabulary element with index 6 is at the third position in the sentence.

Tokenization for Images

The tokenization process is fairly straightforward in the context of language, but it is not always the case. Another common example is tokenizing an image. Generalizing the idea from language, we wish to define a vocabulary for images. One of the early approaches to image tokenization for use with transformers was in the work of Dosovitskiy et al. In this work, they divide an image into patches, where each patch is treated as a token in the language of images. The idea of defining patches is illustrated in the following

An illustration of the tokenization process of an image as described by Dosovitskiy et al. In first row is the original image, the second row is illustrating the introduction of patches, and the last row illustrates flattening the patches into a one-dimensional sequence.

At the top is an image that we want to encode, and then we divide the image into patches of equal size. Lastly, we append the second row of patches to the first row, created an ordered set of these patches. In this example, each patch is the equivalent of a token, and the full image is now an ordered set of these patches.

To encode these patches in a language that the computer understands, we can define a pixel grid, and then an RGB value to each pixel. If the patch has a width of W pixels, and a height of H pixels, then the patch is a tensor X with shape (W,H,3). We can then flatten this tensor to a single vector with length 3 * W * H. This is analogous to working with text with a vocabulary of size 3 * W * H.

In the original work, this tokenization scheme was employed for the task of image classification. There is a subtlety when trying to use this approach for image generation though. In the case of language, the tokens are fundamentally discrete objects. If we look at the vocabulary for the language example, if we ask what is the 4.5th word, it does not make sense. The 4th word is “dog”, and the 5th word is “the”, and a word being somewhere between these two is nonsensical. To remedy this, the vocabulary is encoded as one-hot vectors. What we saw was that, for example, the word “I” is encoded as [1,0,0,0,0,0], and the word “sell” was encoded as [0,1,0,0,0,0]. So if the model predicts the next token to be [0.75, 0.25, 0, 0, 0, 0], then this leads to the interpretation that the model predicts the word “I” with a probability of 75%, and “sell” with 25%. This is what enables the text generation, because when generating a sentence, you can sample this probability distribution for the next token, leading to a non-deterministic result, enabling new sentences each time.

For the tokenization process described here for images, this is not the case. The point of Dosovitskiy et al. was not to build an image generator, it was to design an image classifier, but we want to highlight this important distinction in generative modelling. To remedy this, one approach is to encode an intensity score as one-hots rather than as a continuous value, analogous to what is done by Chen et al. For this approach, the authors treat a single pixel as a token rather than a full patch. This is analogous to saying that an image is the equivalent of a sentence, and the vocabulary is defined over the pixel intensities.

The optimal tokenization strategy for image generation is still an active area of research, with less consensus than in the case of natural language. One strategy is to first preprocessing the image to a lower resolution first, to limit the data needed to store an image. Other strategies include the use of a CNN to define a finite set of features, and treat the set of features as a “code-book” or vocabulary for the images. Also other frameworks are commonly used such as generative adversarial networks (GANs) for image generation, or diffusion based models for this purpose.

Vector Embeddings

The final step in preprocessing the data for a transformer is to define a vector embedding of the data. This mapping solves two problems for large language models. First, it enables semantic clustering of the vocabulary words. The original enumeration of the vocabulary is arbitrary, and words with indices near each other do not have to have a similar meaning. Secondly, the attention mechanism is an operation on sets, and so it is unaware of the ordering of the words in the sentence. The idea behind a vector embedding is to define a vector space of dimension d, which encodes both the position of the word in the sentence, as well as the semantic meaning of the word. Mathematically, we can encode the sentence “I own a dog”, given by the matrix as shown above in the tokenization section, as

where the first tensor represents the element of the vocabulary, and the second tensor represents the positional encoding. The positional encoding is trivially the identity matrix, because we encoded the first word in the sentence as the first row, second word as the second row etc. Now, the vector embedding can be thought of as a mapping for both the semantic meaning, and the positional encoding to the same space. Then, this sentence would become, under an embedding E.

where the vectors sᵢ represent the semantic meaning of each word, and pᵢ represents their position in the sentence. We note that both the semantic embedding and the positional embedding are encoded in the same space, and the vector embedding for a particular word encodes both. Note that the multiplication structure of the sentence prior to the embedding is not necessary to enable the embedding. If we want to label the tokens in the sentence with certain properties, as long as we have a mapping from those properties to the embedded space, we can add the results together to get a single vector containing all the relevant information. A common example is if we wish to encode two sentences at once, we could have a property that states which sentence a particular token belongs to. If we have the two sentences “I own a dog” and “My dog is brown”, we could define a property stating when the first sentence ends and the second one begins. This could look like [0,0,0,0,1,1,1,1], where the label 0 means the first sentence, and the label 1 means the second sentence. We can map this vector to the embedded space, and add the result to the other embeddings, and now the vector will also contain information about which sentence it comes from. The idea here is that if we wish to have additional labels for our data, beyond just the position and the vocab word, these can be encoded during the vector embedding step as well. This is one way art styles can be encoded in image generation models.

Geometric Interpretation

Now, let us look at the task of predicting the next token. If we have the sentence “My favorite book is …”, and we want to predict the next token, this task has a geometric structure in the embedding space. This space encodes both the position, and semantic meaning of each word. This means that a subset of this vector space represents the position, and a subset represents the semantic meaning. For simplicity, let us just imagine that the embedding space is three dimensions, and let two dimensions represent the semantic meaning, and the third dimension represent the positional encoding. This would roughly look like the following

Figure 2: Cartoon of a three-dimensional vector embedding, illustrating that semantic meaning and positional encodings are represented in the same space.

The true embedding space is more complex, but the concept is the same. It need not be the case that the positional space is orthogonal to the semantic space, for example, but this illustration provides some intuition. Then, the task of predicting the next word in a sentence can be illustrated by the following image.

Figure 3: Illustration of the next word prediction task in the vector embedding space. The vertical axis is representing the positional encoding, while the plane is representing the semantic meaning of each word.

Where the sheets 0 through 3 are the semantic meaning of the first four words, and the distance between them is encoding the information about the position of the words. For each word, the red line shows the semantic meaning of the words, the blue dot illustrates the encoding of the position, and the purple vector is the sum. On the final prediction sheet, the range of points illustrates the probabilistic nature of the text generation task. As the training process improves, ideally, the spread of predictions converges to a single point.

If we represent the vector vᵢ as the embedded vector for the iᵗʰ word in the sentence, then what the transformer enables is the ability to infer the vector vₜ, given the sequence v₀, … , vₜ₋₁. If we think of the sequential index as a discrete time label, as in tₙ = t₀ + nΔₜ, then the transformer can be thought of as a differential equation solver that updates the vector by an amount Δₜ. So given v(t), it will generate v(t+Δₜ) with some probability. Imagine if we threw a ball in the air, and plotted the (x, y) position as a function of time. Such a curve would have a similar structure to the the above image, if position were replaced by time, and the semantic meaning were replaced by the (x, y) coordinates. This provides another interpretation of the text generation problem as finding the solution of a stochastic differential equation, and performing a small time step. This enables another class of problems that the transformer architecture is equipped to handle, and that is numerical simulations of differential equations.

Applications

Stock Prices

We can imagine applying a transformer to predict future stock prices. As an example, we can try to predict the next candlestick, given a sequence of candlesticks. Here is an example of a candlestick chart for the Apple stock from tradingview

Figure 4: An Example of a candle stick chart for the stock price of Apple. The chart is from tradingview

For each time step, the candlestick is telling us four pieces of information: the open price (O), the highest price (H), the lowest price (L), and the closing price (C). Each candlestick on this plot represents a 15 minute window, O is the market price at the beginning of the time window, C is the price at the end of the time window, L is the lowest price during that window, and H is the highest price. A line is drawn connecting L to H, and a rectangle is drawn connecting O to C. If O>C, then the candlestick is red, and it is green if O<C. To tokenize this data, we can simply record these 4 values for each time interval, so the tokenized data for N candlesticks would look like

We could use smaller or larger candlestick times, depending on the questions we are interested in. We could also just use the market value over time for simplicity, but then the total data, especially for high-volume stocks, may be an unreasonably large amount. Once tokenized, this could be fed into a vector embedding encoding these tokens, as well as the position for each token.

Similar to the discussion about image tokenization, this encoding is a regression style problem. What this means is that the prediction will be a single value for each of the properties of the next candlestick, and will not produce a probability distribution over all possible candlesticks.

If we want a probability distribution, which enables higher resolution for the likelihood of future stock prices, then we want to encode stock prices as one-hots. If we do this however, it will produce an effective vocabulary over stock prices that is potentially enormous. If we look at the spread of possible stock prices, they range from penny stocks, to stocks such as Berkshire Hathaway trading at around $500,000. If we encode one-hots at the level of cents, then there are 100 possible values for every dollar, leading to an astronomically large vocabulary size for a transformer. State of the art transformers can handle a vocabulary size on the order of 50,000–100,000. One approach is to take each stock, and map it to discrete levels, scaling by the typical trading values, with a different mapping for each stock. So for Berkshire Hathaway, perhaps each one-hot encodes $100 intervals, leading to a vocab size of at least 5000 words to range from $0 to $500,000. The issue is that the model will only be sensitive to changes that are at least $100, providing low-resolution information. We propose a better alternative inspired by the laws of physics.

If we imagine a particle has a position given by r(t), and we want to know its location r(t+Δt). If Δt is sufficiently small, then we can write

Where the fraction in the end is the velocity of that particle. This means that if the time step is sufficiently small, then the set of possible positions the particle could be at in the next time step is constrained by its velocity. There is a universal speed limit for particles, given by the speed of light, meaning that the position of the particle r(t+Δt) must remain close to the position r(t). In the context of stocks, what this means is that if Berkshire Hathaway is trading at $500,000 now, in the next 15 minutes, it is very unlikely to be trading at $10. If we look at smaller time frames it becomes even more unlikely. So, what we can do is define one-hot vectors that only encode trading prices that are in the vicinity of the current trading price, and have the transformer predict the probabilities of only those prices. You can look at past prices to get an estimate of the maximum “velocity” of the trading price, to aid in defining the size of the neighborhood around the stock price you want to examine. If in a 15 minute time window, the price has never changed by more than 1%, then using a window that encodes say 2% up or down should be sufficient. In that case, if the current price were $500,000, then we can say with high likelihood that the price in the next time interval will be between $490,000 — $510,000. We can then divide this spread into, say, 1,000 one-hot vectors, and then predict with probabilities where the price will be for the next candlestick.

We could also add an additional property, such as a label for the specific stock, in this case AAPL. If we look at multiple stocks, such as Amazon (AMZN), Meta (META), and Tesla (TSLA), we could enumerate them, such as

then for the Apple chart, we could also map the stock id taking the form

where the row is labelling the stock, and the column is labelling which candlestick we are looking at. This would allow the model to look at multiple stocks at the same time. Moreover, after training different stocks with similar candlestick charts will cluster together under the stock id embedding. This will lead to additional information about how correlated two different markets are.

With such a model, we could then generate possible future candlestick charts from a sequence of candlesticks for a given stock. It is likely that the chart alone does not contain all the necessary information to predict future prices, but this would provide a model to simulate future stock price trajectories. Such a model could then be run a large number of times to get statistics about how likely the price is to increase or decrease, and by how much.

Quantum Simulations

The differential equation solver interpretation of the transformer could be especially useful in simulating quantum systems, where calculating the wavefunction ψ(t + Δt) from ψ(t) is very expensive. In quantum simulations, the wavefunction is given by a large vector ψ(t), and the time evolution is given by

where Η is an operator (matrix) called the Hamiltonian that encodes the interactions between the particles in the system. So to perform a small time-step exactly requires the full diagonalization of a large matrix, which requires a large amount of compute. However, if we performed these calculations for multiple time steps to get the data ψ(t₀), ψ(t₁),…, ψ(tₙ), then we could treat each wave function as a token, and the sequence of wavefunctions as a sentence. If we do this for many starting wave functions, then this would define the corpus of text, which is used to learn the language of the Hamiltonian Η. Once trained, if the inference cost of the transformer is lower than the cost to perform the time evolution exactly, this could provide an efficient approach to simulations in the field of quantum dynamics.

An interesting result of such a model is that the transformer would now be encoding the Hamiltonian, which is a well studied object in quantum physics, possibly giving theoretical insights into how the transformer is working. A similar connection has been made between tensor network quantum states and deep convolutional networks, providing an understanding of the networks expressive ability in terms of the quantum entanglement of the corresponding wavefunction.

Conclusion

Generative AI is a revolutionary technology that is still in the early stages. We see the task of predicting the next word in a sentence has the same structure as solving a differential equation with a finite time step. The success of the transformer, and this geometric interpretation, suggests that the applicability of these large language models extends far beyond just natural language. This includes a wide range of possibilities such as predicting stock prices, weather forecasting, quantum mechanical simulations, and beyond. We encourage you to also think outside the box, and see how this technology can be applied in areas that have yet to be explored.

If you are excited about this work please apply in one of our positions.