How to Bring Kandinsky to Life Using Rotation Matrices for Video Generation — Splitter (Part 2)

Mike Puzitsky
9 min readJul 20, 2024

--

In the first part, I introduced you to the approach of using rotation matrices in the task of video generation, initially based on intuition and later somewhat formalized after a preliminary immersion into group theory. I was then ready to move towards solving the task using machine learning.

Hypothesis

The methodology was based on an approach used in video codecs:

  • Keyframe
  • An algorithm describing changes from the keyframe to a certain depth.
I-Frame

The widely used I-Frame approach, which employs a keyframe and subsequent frames of changes, suggests that we need to learn to convey motion information to the model in the form of changes — displacements!

The next slide schematically presents two latent spaces of different modalities.

Latent Spaces of Different Modalities

In addition to the main vectors, there are change vectors. Since previous experiments were based on changes, a hypothesis emerged.

A model can be trained to predict the vector for generating the i-th frame by providing the model with information only about the changes from the 0-th frame through a loss function.

Loss Function

Previous experiments showed that it makes sense to build neural network training based on changes. The changes themselves can vary, and from the perspective of vector space, they can be angular changes of the vector or the vector’s length. Changes in pixels in frames, through transformation by the encoder, will be reflected in the image embeddings. Therefore, for training a conditional model on these small changes in frame embeddings, a combination of two error functions was chosen as the main approach.

Based on preliminary experiments and the analysis of vector changes in the latent space, it was decided to use a combined approach, including two different loss metrics: CosineEmbeddingLoss and MSELoss.

The slide below presents the loss function, connecting changes in the spaces through angular and metric data.

Combined Loss Function

At first glance, it’s a simple formula, but with a key feature — the output of the loss function for gradient calculation should not be a scalar but a vector of the latent space’s dimensionality. Essentially, during training, we are addressing not the average gradients at a point but the gradient field. We will be learning from changes in different spaces, i.e., the gradient of changes in the field.

Splitter — What Is It?

To implement the training algorithm, I decided to use the Kandinsky 2.2 model, which I had already used for initial tests (see Part 1). Kandinsky 2.2 is built on an approach similar to unclip, which is used in some versions of Stable Diffusion and DALL-E2. More precisely, Kandinsky 2.2 uses the diffusion_mapping approach to transform high-dimensional text embeddings into latent embeddings while preserving geometric properties and connectivity. The resulting embeddings are of the same size as the embeddings from the image encoder.

Kandinsky 2.2

The generation in the Kandinsky 2.2 diffusion model is based on the Image-2-Image principle. In this model, the second part — the Decoder — contains a Unet Image-2-Image diffusion model and an MOVQ model to upscale the image to higher resolutions. The first part of the model includes the Prior model, which is trained to bring the unclip embeddings of texts and the embeddings of images closer together. These embeddings have high cosine similarity, and the Unet in the decoder is trained to reconstruct images from noise using the image embeddings effectively.

This functionality of the Kandinsky 2.2 model is highly advantageous for tasks where continuity and dynamics of scene changes need to be maintained. It provides a deeper understanding of transforming text into slightly changing frames, making it extremely useful for experimentation with my approach.

The similarity between the unclip embeddings of texts and the embeddings of images also allows for combining images and texts to create new images using Kandinsky 2.2. The model’s structure, the possibility of separate training and usage of modules, and the quality of image generation made it a convenient test algorithm for my experiments. I added a new module, which I named “Splitter,” to create a composite model from Kandinsky 2.2 that can generate video sequences.

Inserting the Splitter Model into Kandinsky 2.2

To adapt it for the Text2Video task, a Splitter model is applied between the Prior and the Decoder. The Splitter model modifies the vectors used for generating images in the Decoder.

The Splitter takes the following inputs:

  • The ordinal number of the predicted vector,
  • Full text vectors from the CLIP-ViT-G model used in Kandinsky,
  • The initial vector from the Prior model.
Basic Training Scenario for Splitter

If the proposed scenario works, the initial results should be observable even on a simple model.

The Splitter has a straightforward configuration consisting of input embedding layers, followed by a cascade of linear down-sampling layers, regularization layers, and non-linearity layers. The output is the predicted modified embedding, which can subsequently be used to generate an image with the Kandinsky 2.2 decoder model.

The training scenario is custom-built, and I will discuss it further. All training is conducted on a T4 GPU, which is significant given the volume of experiments that needed to be conducted. During the training of the Splitter model, everything remains within the latent space of the embeddings.

Dataset

To create the necessary training data, the simple and convenient TGIF dataset was used.

Pros:

  • Convenience
  • Variety
  • Large number of clips
  • Clip lengths within 100 frames

Cons:

  • Mostly low resolution
  • Some clips are partially static
  • Short textual descriptions

To filter out problematic data, a separate script was used to prepare the dataset for future training of the Splitter model. The data generated by the script represents a PyTorch dataset of already vectorized data. Due to limited resources, 200 filtered and vectorized clips were automatically selected by the script for initial test training.

nitial Training and Testing

Training is conducted by randomly selecting a batch of frames from a film if the film is longer than the batch size, or all frames are taken if the film is shorter.

Frames are shuffled.

Films are also shuffled in each epoch.

Model Inference Scheme Based on the Generated Frame Number

The initial results of the simple approach, which inspired me for further research, are especially noticeable in the dancing couple. Despite the instability of the background and clothing, there is a complex relationship in their joint movements.

Searching for Improvements

It seems like there is a lot of data, with each clip containing between 25 and 100 frames, but there is only one text description for all the frames.

If you play the clip backward, the same description often applies. We will train the model in both directions, from the initial frame to the final frame, by providing the model with a direction label as input.

Adding the Direction Label to the Dataset and Model

Another Hypothesis — SCISSORS

New Starting Frame

Frames can be trimmed from the beginning or the end of a clip, and the new starting frame can be used. Often, the same description will still apply.

However, for correct operation and data enrichment, we need to modify the initial text vectors using rotation matrices obtained from the vectors of the original starting frame and the new starting frame.

Applying Rotation Matrices for Augmentation

Thus, rotation matrices are introduced into the training process. To enable their use during training, a script was implemented to apply them on tensors so that all calculations are performed on the GPU. In the first part of the article, where there was no training, the rotation matrices were implemented using numpy.

Calculating Rotation Matrices

Using the obtained rotation matrix from the old and new keyframes, we apply it to the initial unclip vector to obtain the modified unclip vector. We do the same for modifying the full text vectors based on the rotation matrix obtained from the old and new unclip vectors.

Since training occurs on random frames from the film, it is necessary to account for the number of the “new zero” frame on the fly and select only those frames that are in the random selection further in the film. Frames before the “new zero” frame will be used for training when reversing the direction. A class was added to account for the order of frames, their shuffling, and control of the frame’s position in the batch after shuffling. Accounting for and controlling the position and distance from the keyframe and pseudo-keyframe to the frame in a random batch is very important. Any mistake leads to averaging and “freezing” instead of learning the specific features in the changes.

Example After Initial Training Supplementation

Markov Chain

The output of the Splitter model is a vector of the same nature as the input. Therefore, the idea of an additional regression step in the model during training arose — making a prediction from a prediction by feeding the predicted embedding back into the Splitter model.

Objective:

  • Enrich the training data.
  • Train the model to create deeper vectors.
Autoregressive Step

Rotation matrices are also applied here to transfer changes into the text space from the initial starting frame to the predicted starting frame.

Model’s Autoregressive Capabilities

The slide above presents examples of generations from vectors obtained by the model autoregressively, depending on how the model was trained.

It is evident that the model’s capabilities vary depending on whether the training included the step with rotation matrices and the regression step.

Training Trainer

A schematic representation of the custom training trainer, which combines:

  • Working with a random small sample of clips in each epoch for more robust and generalized model training.
  • Random batch of frames from the clip for robustness to heterogeneity in the frames.
  • Regular training step in both directions.
  • Training step with scissors in both directions.
  • Autoregressive step in both directions.
  • Implementation of rotation matrices for GPU computation.
  • Periodic model updates with the best weights based on different components of the loss function to avoid getting stuck in a local minimum.
  • Each saved model checkpoint includes the training history, allowing for the use of statistics when retraining with modified parameters.
Advanced Training Trainer for Splitter Model

The constructed trainer allows for comfortable training and fine-tuning of the model on large volumes of data using a T4 GPU. The length of the clips can vary. For training, 500 examples were selected.

Comparison of Trained Weights of Splitter

Comparative Generations with the Same Seed

Comparative Generations with the Same Noise Based on Vectors Obtained from Models with Different Training Types and Their Combinations
It is quite noticeable that additional training steps, whether with rotation matrices or with the regression step, add extra information to the model, making the generation more interesting.

Interesting Results

Interesting examples of generation from vectors based on the step number from the starting vector. They demonstrate fairly coherent frames of the generated video sequence.

Steps Generation

The slide presents examples of generations from vectors obtained by the model autoregressively, depending on how the model was trained.

It is evident that the model’s capabilities vary depending on whether the training included the step with rotation matrices and the regression step.

Autoregressive Generation

The research from this stage is also presented in my repository.

This was the pre-defense stage, and based on its results, I have already delved into the complexities of the Splitter model itself to understand how to improve its generative capabilities. To be continued.

--

--