A moving Mona Lisa?!

Anisha Yadav
Predict
Published in
4 min readMar 2, 2022

A breakdown of Artificial Intelligence and its role in generating a photorealistic moving model

The "living" portrait

A painting created in the renaissance revolutionized the way artists looked at realism, the understanding of the skull beneath the skin. Produced by Leonardo Da Vinci, The Mona Lisa is a scarily realistic painting that, even five centuries later, draws thousands of spectators each day. And while the recreation of Lisa Gherardini's smile will remain unmatched, what if there was a way we could bring even more life to the painting? Can we make the Mona Lisa move and talk?

It turns out it's not too difficult; the short answer: Artificial Intelligence. Researchers in Moscow, working under Samsung AI, have developed a way to use deep learning algorithms known as Artificial neural networks (ANNs) to create hyper-realistic synthetic images.

Neural Networks: Back to the basics

A neural network can be understood as an artificial way to reflect the behaviour of the human brain, as they replicate the way biological neurones send signals to one another to compute information. ANNs are segregated into three layers: An input layer, hidden layers, and an output layer. These layers are connected by nodes, artificial neurons that help relay signals.

input-output structure of an artificial neurone

Nodes have a predefined weight and threshold. Once the input layer is identified, weights(Wn) are allocated to different input variables(Xn) that play a role in decision-making. Next, the input value goes through mathematical operations in which the input signal is multiplied by its weight (WnXn).

The weight plays a vital role in this process, as its polarity (negative or positive) and strength affects the input's importance. The overall influence of the input signal is determined by finding the sum of the multiplied value (WnXn) and bias (b, An additional set of weights that remain constant).

The summation looks like this: ∑ wi xi + b

The summation value is then passed through an activation function(f) and finally looks like this: f(∑ wi xi + b)

We can understand an activation function as the minimum value that the summation requires to be to activate a node and initiate the passage of data from one layer to the next. If the node is activated, it means that the value of f(∑ wi xi + b) surpasses this minimum threshold, allowing the current layer's output to become the input of the subsequent layer.

The whole process then repeats until the final layer.

Convolutional Neural Networks: Let's dive in deeper

Making the Mona Lisa come to life uses a more specific type of ANN known as Convolutional Neural Network (ConvNet), which is different from ANNs because they cater to image, video and speech recognition.

A breakdown of ConvNets and the layers within
A breakdown of ConvNets the layers within

ConvNets have three layers:

  • Convolutional layer- input data, feature map, and filter
  • Pooling layer
  • Fully-connected (FC) layer

Let us assume that the input is the picture of the zebra. This means that the picture will be a colour image made up of a matrix of pixels in 3D; this means that the image has height, width, and depth. The first two layers of a ConvNet perform feature extraction. This is done by using a feature detector known as a kernel. The kernel is given a 2D array value, which corresponds to the pixels on the image that the kernel has to find and extract. A filter is then applied to that specific area of the image, and the input is converted into an output array.

For example, as seen in the diagram, the kernel was given the array for the pixels corresponding to the zebra's leg (a feature of the zebra that needs to be extracted) is converted to the output. As one layer produces an output given to the next layer, the extracted features become progressively more complex. This means that the algorithm begins to identify more delicate details and prominent features. The third layer of a ConvNet maps the extracted features into the final output as the image is flattened into a column vector and fed into a basic artificial neural network that I discussed earlier.

The ConvNets in the Mona Lisa's case were used to "learn" and "map" human facial expressions and movements from 3 sets of data, resulting in three very different and realistic animations of the Mona Lisa talking.

What's next?

Right now, seeing the Mona Lisa come to life is just one of the many things that neural networks are capable of doing. In the future, we can expect Neural networks to further the art world by composing music, shine in the medical industry by helping humans self-diagnose, and even be used to create hauntingly realistic, deepfake news-reporters!

--

--