How to Bring Kandinsky to Life Using Rotation Matrices for Video Generation (Part 1)

8 min readJul 20, 2024

This publication on the text-to-video task is based on my recent master’s thesis at MIPT, and it is my first article.

Initially, the topic of my master’s thesis was formulated as — generating video based on textual descriptions and Stable Diffusion. I started working on this project in early October 2023. At that time, there were still few publications on video generation and ready-made code for solving the text-to-video task. I remember two projects that stood out: Text2video-zero and ANIMATEDIFF.

The Text2video-zero project already had a paper from March 2023, in which the authors proposed adding a temporal axis to the U-net diffusion model structure and training the model to generate a batch of frames simultaneously, using sequential images from videos for training. This approach seemed quite logical.

The ANIMATEDIFF project was announced on the Stable Diffusion website, describing the team’s tested approaches to video generation by manipulating noise and incorporating various LoRa adapters into pre-trained Text-to-Image models for the Unet model. These adapters were to be trained, while freezing the main layers of Unet, to account for changes in consecutive frames.

I was interested in finding my own approach, and overall, my immersion in diffusion models was still in its early stages at that time. Therefore, I decided to start with the simplest approach.

What is the problem with generating video from text?

Unlike generating a single image, we need to obtain a series of closely related images with small changes dictated by the text itself.

As can be seen from the slide, this is not an obvious solution.

Essentially, the verb “моет” should drive the related changes, while the words “мама” and “раму” should not cause the images to change. But how can this be achieved?

Pixel and CLIP Spaces in Video

First, I looked at the pixel space. The slide shows both the frames and the changes between them. By subtracting the changes, you can go from the last frame back to the first one.

Next, I became interested in what happens to the CLIP vectors of the frames and the flattened vectors of pixel matrices (flat vectors). Specifically, I wanted to see how similar these types of vectors are between different frames. To do this, I simply built correlation matrices of the frames with each other.

The correlation matrix of the pixel space demonstrates a wider range of correlation values, with a noticeable decrease compared to the CLIP space. This indicates greater sensitivity of the pixel space to minor changes between frames.

The correlation matrix of the CLIP space shows more stable and smoother transitions between neighboring frames. This suggests that CLIP embeddings abstract high-level information from the images, while the pixel space is more sensitive to low-level details.

However, if we look at the changes between only adjacent frames, the relationship between the two spaces in transitions becomes apparent. Both spaces capture the main changes in the video, but with different degrees of sensitivity to details.

I hypothesized that a displacement tensor could be formed based on the correlation matrix of the CLIP space of images. This tensor could then be applied to the text vectors or the initial noise itself.

The CLIP model was trained to bring the vectors of the text space and the image space closer together. Thus, it was reasonable to expect something meaningful from such manipulations.

Next, I formulated my hypothesis, which I based my research on: Controlled changes in text embeddings can lead to small changes in generated images, forming a video sequence.

Initial Experiments

At first, I decided to experiment with modifying noise by applying a similar identity matrix close to the unit matrix from generation to generation while preserving the initial noise by fixing the seed.

Conditional Example of Noise Drift Tensor Formation

Under certain matrix parameters, such small changes started to occur. This clearly hinted that the idea was not useless.

How to Transition to Texts?

From the NLP course, I remembered that in the embedding space of Word2Vec, there is also a geometric relationship between embedding vectors and angular similarity of vectors.

Ultimately, the concept of embedding rotation began to suggest itself. This immediately led to the Rodrigues’ rotation formula for 3D space.

It is widely used in 3D graphics, robotics, and many other fields. The formula itself is also known as the rotation matrix. One might ask, what does 3D space and object rotations have to do with this? I am frequently asked this question. Much of what I write further are attempts to answer this question, but I was driven simply by the intuition that transformations in spaces of any dimensionality should follow certain common principles and should have invariants and laws of their transformation.

Rotation Matrices in Multidimensional Space

To transition to N-dimensional space, one needs to delve into Group Theory. Through the concept of Abelian groups, it is possible to exponentially map Lie algebra to a Lie group and apply rotation generators. Multidimensional rotation can be decomposed into a product of two-dimensional rotations. Each two-dimensional rotation affects only the two dimensions in which it operates, leaving the others unchanged.

Screenshots from an Introductory Course on Group Theory in Russian

When working with generators AAA in the form of skew-symmetric matrices, the use of the exponential function is the basis for a smooth transition from algebra to geometry, particularly from infinitesimal transformations to finite increments. This is important in physics and engineering, where abrupt changes can lead to undesirable behavior, such as mechanical failures or unrealistic animation in graphics.

where n1 and n2 are n-dimensional orthogonal unit vectors

The exponential representation of the rotation matrix through Taylor series expansion and regrouping of terms leads to the formula for the N-dimensional rotation matrix. The slide presents the main formula for the rotation matrix between two vectors in multidimensional space, consisting of three terms.

I — identity matrix, α\alphaα — angle of rotation, Rotation in high dimensions

In order:

Identity matrix — this term ensures that the components of the vector aligned along the axis of rotation are not affected by the rotation.
Skew-symmetric term — this term is crucial for creating rotation in the plane perpendicular to the axis formed by vectors n1n_1n1 and n2n_2n2. This term is responsible for the actual effect of rotation.
Symmetric term — this term adjusts the components parallel to the axis of rotation and their contribution to the overall rotation.

Experiments with Rotation Matrices

Discovering the formula for N-dimensional rotations laid the foundation for the first experiments with rotating text embeddings.

Applying Rotation Matrices to Embeddings

The slide schematically shows that changes in the text vectors used for generation in the diffusion model are transmitted through a rotation matrix obtained from the vectors corresponding to the frames. As seen in the slide, for testing, the i+1 vector is obtained by adding a small increment to the i-th vector from the matrix product of the rotation matrix and the same i-th vector, with the smallness of the contribution determined by the coefficient g. This is similar to perturbation theory.

The initial experiments were conducted on Stable Diffusion models 1.4 and 1.5. A video clip was taken, and then a rotation matrix was calculated from the CLIP vectors of neighboring frames, which was then applied to the text embedding before generating the image. The slide presents successful examples.

Next, I became interested in the Kandinsky 2.2 model, which is built on the unclip approach. It includes a diffusion model that effectively works as an Image-2-Image model. The textual information is concentrated in the Prior model, which learns to produce vectors from projections of text vectors that are as close as possible to image vectors. A similar structure is found in the DALLE model.

In Kandinsky 2.2, the embeddings after the Prior model have a higher similarity to image embeddings, which theoretically should work better with the multidimensional rotation approach.

Generation with Kandinsky 2.2 Using Rotation Matrices

The following slides present some examples of generations by the Kandinsky 2.2 Decoder from modified embeddings of the Prior model using rotation matrices derived from changes in an external video sequence.

Influence of External Video Sequence on Embeddings for Image Generation

Experiments with applying rotation matrices and the Kandinsky 2.2 model were conducted with text, noise, and in a combined manner.

On the right, noise modification from frame to frame based on the previous generation.

More Examples of Video Sequence Generation

Conclusion

In the experiments with rotation matrices, the ability to transfer information about changes between latent spaces of different modalities was demonstrated.

The research showed that:

Control of generations through rotation matrices is possible.
Managing changes can become the basis for developing a machine learning methodology.

The results inspired me as they pointed out the directions for my future research. I will discuss these results in the next part.

You can listen to my presentation on the above material on my channel, which I have just started to fill with content. The research from this stage is presented in my repository.

I welcome questions and comments both on the topic and on the style of presentation, as I am still learning to write articles. To be continued.