Equinox Architecture: Divide and Compute

Dakshish Singh
6 min readJul 5, 2024

--

Abstract:

Equinox Architecture introduces the concept of “Divide and Compute”. It performs sequential processing and generation in O(n) and, when parallelized, in O(log n) time complexity. It utilizes a tree model to compute a sequence using neural networks.

I am just a 15-year-old student, and there would be a lot of errors, I want to know if it is anything of worth. You can email me at: dakshishsingh12@gmail.com

I want to write a paper with the help of you. If you find anything of value or have any doubts then email me.

Toy problem:

Let’s try to sum ’n’ (here 8) numbers using a neural network. We will train a single neural network that can sum two numbers. We will divide our sequence into pairs of 2 numbers and input them into the network, which would generate another smaller sequence. We will repeat this process until we end up solving our problem (here: we end up with a single number that should be the sum of those numbers).

Here, a circle represents a neural network that can add two numbers. Using that neural network and concept, we can sum any 2^n numbers.

We can apply this concept to train an LLM.

Here, a, b, c, d … are embeddings of our token sequences.

Q) Why should it work?

A) It’s the job of the neural network to find a way that allows it to read in pairs of sequences rather than reading word by word. The advantage it has is that the distance between any two tokens is 2 * log n, whereas, in a recurrent network, it is linearly distanced. That could potentially allow for better contextual exchange of information. As you need to train a single network (in theory for now), you can easily train very large neural networks.]

Q) Why should it not work?

A) Maybe it is just impossible that language can’t be read in pairs of tokens. If this problem exists, then maybe a larger network could solve these issues using multi-head, multi-layer, and MLP approaches, just like transformers. But most likely, neural networks will find a way.

If you don’t have 2^n tokens to make complete pairs, you can simply feed in a special token or 0 vector, and the neural network will learn to handle this case.

A neural network can take 4, 8, 16, and etc on vectors and generate 1,2,3,4 as vector output vectors.

Training an LLM without training on predicting the next token:

Now, we will train an LLM that uses Equinox Architecture as a sequential processor. For this project, we will be making a big change: our single neural network of Equinox Architecture will be trained as an autoencoder that gets inputted 2 vectors and generates an embedding in the length of 1 vector. By vector, we mean the length of our encoded token. Each layer will have a different network, but for a layer, there will be only a single network that handles all sequencing tasks of that layer. So our Equinox part acts as a compressor, not like a transformer.

I have trained 2 generations of LLM. Both of these LLMs and distil-GPT used the vocab and encoding of GPT-2. We will compare our model with distil-GPT. Both generations were trained on PTB-text-only. Both my LLMs and distil-GPT were tested on the PTB-text-only test set. Distil-GPT and my LLMs used the same vocab and encoding scheme, and our models (40–45 million parameters) are about half the size in parameter count compared to distil-GPT (88 million).

A figure of our model looks like something:

The Transformer block takes an embedding and outputs the embedding of the next token; it’s just a plain neural network. The predictor block tries to predict the next token using the generated embedding of the next token. All parts are trained individually one by one: First the Equinox block, second the Transformer block, and then the predictor block.

Results of Generation 1:

Perplexity and negative likelihood on predicting the nth token:

3rd token: 142.448708 (4.95)

5th token: 1715.15373 (7.44)

9th token: 109365.979 (11.60)

Size of model:

On predicting: Equinox’s parameter count Decoder’s parameter count

• 3rd token 1,180,416 40,492,801

• 5th token 2,360,832 40,492,801

  • 9th token 3,541,248 40,492,801

Generation 2:

In Gen 2, I experimented with a network that could take more than 2 tokens at a time and was trained in a better manner. The size of the models didn’t change much.

Note: On average, a single example contains only 25–26 tokens. In order to train our model with a high number of tokens, we consider the whole dataset a single example. From that single example, we select some part of it to try to predict the next token. So, our model receives a lot of sentences from different examples as input and tries to predict the next token that would be of the last sentence.

Results of our model:

The column tells about the number of vectors received by our single neural network of the Equinox block. The first row tells about how many layers are there in the Equinox block. So you can find the number of tokens being fed into our LLM as column number ^ row number. The second column and second-row network will receive 16 (4²) tokens as input.

You will see that as you increase the number of tokens, your perplexity will increase, likely because it was trained as an autoencoder and may be due to the way we train our model.

Now here is the perplexity of distil-GPT on different token counts:

Format: Context length -> perplexity

64 -> 58.5089, 32 -> 63.2031, 16 -> 73.8640, 8 -> 101, 4 -> 185.0610, 2 -> 419.9217

Gen 2 model Performs better in predicting at low context lengths, i.e., 2, 4, and is comparable in predicting at 8 tokens. So it performs better in short context even after being half parameter count and not being trained as true LLM.

1st collum model: https://drive.google.com/drive/folders/11fE1ec5iEmlA_7ZzUaaROp_8QwmZ4LiF?usp=drive_link

2nd collum model: https://drive.google.com/drive/folders/1HtH3FGtrEesBqy5USY7IU1FGrqiwJ0dv?usp=drive_link

3rd collum model: https://drive.google.com/drive/folders/1Isii4X9nqoU4pF2m0Uv9jgkndU2jPfym?usp=drive_link

4th collum model: https://drive.google.com/drive/folders/1YZzLj61ImeYP_gsXJnrKKHiLgCSCSBIL?usp=drive_link

5th collum model:

https://drive.google.com/drive/folders/12T4QrfRi0iXyRGURQD-_04y5ykEprLl7?usp=drive_link

how it could be applied to image generation:

It has most potential in image generation. We can feed our emending of the image we want to generate then a neural network will generate 2 emending let's say represent two parts of the image (up and down) and we will repeat this process until we generate emending for each pixel or group of pixel and using another neural network we could generate pixel value.

--

--