Neural Style Transfer: Merging Art and AI to Create Masterpieces

9 min readSep 29, 2024

Introduction: The Intersection of Art and AI

Art has always been about expressing emotion, telling stories, and capturing the world in a unique way. Every painting style — from the delicate strokes of line art to the bold forms of cubism — represents a distinct artistic vision. But what if we could merge these styles with the content of a completely different image? Enter neural style transfer, a fascinating application of deep learning that allows us to synthesize new images by blending the content of one image with the style of another. This technology not only deepens our understanding of how AI perceives images but also opens a new frontier in the world of digital art.

In this article, we’ll dive deep into neural style transfer, how it works, why it uses mathematical tools like the Gram matrix, and provide examples of how it can transform images into famous artistic styles like Line Art, Impressionism, Pointillism, Art Nouveau, and Cubism. We’ll also walk through the technical nuances of the algorithm, breaking down the underlying code to show you how this stunning process unfolds.

What is Neural Style Transfer?

Neural style transfer is a method that enables a deep neural network to create a new image by transferring the artistic style of one image (usually a painting) onto the content of another (typically a photograph). This technique is rooted in Convolutional Neural Networks (CNNs), which are models widely used in computer vision tasks such as object detection and classification.

At the heart of CNNs is the ability to break down an image into different layers of abstraction. Lower layers in the network focus on basic features such as edges and textures, while higher layers capture more abstract concepts like objects and their arrangement in space.

Content Representation:

The content of an image is captured by the higher layers of the CNN. These layers are responsible for recognizing objects, shapes, and spatial configurations in the image, without focusing on fine textures or colors.

Style Representation:

The style of an image, on the other hand, can be captured by the correlations between the different feature maps within the network’s layers. This is where the Gram matrix comes in — by computing the correlations between features in different layers, we can capture the texture, color, and overall aesthetic of the image, i.e., its style.

Why the Gram Matrix?

The magic behind style transfer lies in the Gram matrix, which is used to represent the style of an image. The Gram matrix captures the spatial correlations between different feature maps, essentially summarizing the texture and patterns present in the image.

Here’s how it works:

Let’s say we have an image that passes through a particular layer of a CNN. This layer will produce multiple feature maps (i.e., filtered versions of the image). The Gram matrix computes the pairwise inner products between these feature maps.

Mathematically, the Gram matrix for a layer l is computed as:

These correlations capture the texture and color palette of the style image, abstracting away the content and focusing purely on aesthetic features. By minimizing the difference between the Gram matrices of the style image and the generated image, we ensure that the style of the generated image closely matches that of the style image.

How Neural Style Transfer Works: The Core Algorithm

To create a new image that combines the content of one image and the style of another, we define a total loss function that is a combination of two distinct losses:

Content Loss:

The content loss ensures that the generated image retains the structure and arrangement of objects from the content image. It is computed as the difference between the feature maps of the content image and the generated image at higher layers of the network.

Style Loss:

The style loss ensures that the generated image adopts the textures, colors, and patterns from the style image. This is computed as the difference between the Gram matrices of the style image and the generated image across several layers.

Total Loss Function:

The total loss is a weighted combination of the content loss and the style loss:

Where:

α and β control the balance between content and style. By adjusting these parameters, we can emphasize one over the other to produce different effects in the final image.

Once the total loss is defined, we use gradient descent to minimize the loss, iteratively updating the generated image so that it simultaneously resembles the content of one image and the style of another.

The Power of VGG Networks in Style Transfer

The CNN model most commonly used for neural style transfer is the VGG-19 network, a pre-trained network known for its strong performance on image classification tasks. VGG-19 is ideal for style transfer because it has been trained to recognize a wide variety of objects, and its hierarchical feature maps capture the content and style of images in a highly structured manner.

In particular, the lower layers of VGG-19 capture the fine textures and patterns (useful for style representation), while the higher layers capture the arrangement of objects (useful for content representation). This makes VGG-19 a perfect choice for combining content and style in a meaningful way.

Step-by-Step Implementation

Let’s now walk through the code that implements neural style transfer. The basic steps are as follows:

Step 1: Import Libraries and Define Necessary Functions

First, we need to import the necessary libraries and define any helper functions we’ll use throughout the code.

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
import copy

Step 2: Load and Preprocess Images

We need to load the content image and style image, and preprocess them to be compatible with the VGG19 network.

# Desired size of the output image
imsize = 512 if torch.cuda.is_available() else 128  # Use small size if no GPU

loader = transforms.Compose([
    transforms.Resize(imsize),
    transforms.ToTensor()
])

def image_loader(image_name):
    image = Image.open(image_name)
    image = loader(image).unsqueeze(0)  # Add batch dimension
    return image.to(device, torch.float)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load images
content_img = image_loader("path_to_content_image.jpg")
style_img = image_loader("path_to_style_image.jpg")

Step 3: Define Content and Style Loss Classes

We create custom classes to compute the content loss and style loss at each layer.

class ContentLoss(nn.Module):
    def __init__(self, target,):
        super(ContentLoss, self).__init__()
        # Detach the target content from the graph
        self.target = target.detach()
        self.loss = 0

    def forward(self, input):
        # Compute content loss
        self.loss = nn.functional.mse_loss(input, self.target)
        return input

Similarly, for the style loss:

def gram_matrix(input):
    batch_size, feature_maps, h, w = input.size()
    features = input.view(batch_size * feature_maps, h * w)
    G = torch.mm(features, features.t())  # Compute Gram matrix
    # Normalize the Gram matrix
    return G.div(batch_size * feature_maps * h * w)

class StyleLoss(nn.Module):
    def __init__(self, target_feature):
        super(StyleLoss, self).__init__()
        # Compute and detach the target Gram matrix
        self.target = gram_matrix(target_feature).detach()
        self.loss = 0

    def forward(self, input):
        # Compute style loss
        G = gram_matrix(input)
        self.loss = nn.functional.mse_loss(G, self.target)
        return input

Step 4: Load the Pre-trained VGG19 Network

We use the pre-trained VGG19 model and extract features from it.

cnn = models.vgg19(pretrained=True).features.to(device).eval()

Step 5: Normalize the Input Image

The VGG networks expect images normalized in a specific way.

# VGG networks are trained on images with each channel normalized
# mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]
normalization_mean = torch.tensor([0.485, 0.456, 0.406]).to(device)
normalization_std = torch.tensor([0.229, 0.224, 0.225]).to(device)

class Normalization(nn.Module):
    def __init__(self, mean, std):
        super(Normalization, self).__init__()
        # Reshape mean and std for broadcasting
        self.mean = mean.view(-1, 1, 1)
        self.std = std.view(-1, 1, 1)

    def forward(self, img):
        # Normalize the image
        return (img - self.mean) / self.std

Step 6: Build the Style Transfer Model

We construct a new model that includes the normalization module, content loss modules, and style loss modules at appropriate layers.

# Specify layers to use for content and style losses
content_layers = ['conv_4']
style_layers = ['conv_1', 'conv_2', 'conv_3', 'conv_4', 'conv_5']

def get_style_model_and_losses(cnn, normalization_mean, normalization_std,
                               style_img, content_img,
                               content_layers=content_layers,
                               style_layers=style_layers):
    # Copy the CNN to avoid modifying the original
    cnn = copy.deepcopy(cnn)

    # Create the normalization module
    normalization = Normalization(normalization_mean, normalization_std).to(device)

    # Lists to hold content and style loss modules
    content_losses = []
    style_losses = []

    # Sequential model to add modules to
    model = nn.Sequential(normalization)

    i = 0  # Incremental index to track layers
    for layer in cnn.children():
        if isinstance(layer, nn.Conv2d):
            i += 1
            name = f'conv_{i}'
        elif isinstance(layer, nn.ReLU):
            name = f'relu_{i}'
            # Replace in-place ReLU with out-of-place ReLU
            layer = nn.ReLU(inplace=False)
        elif isinstance(layer, nn.MaxPool2d):
            name = f'pool_{i}'
        elif isinstance(layer, nn.BatchNorm2d):
            name = f'bn_{i}'
        else:
            raise RuntimeError(f'Unrecognized layer: {layer.__class__.__name__}')

        # Add the layer to the model
        model.add_module(name, layer)

        # Add style loss module if at the correct layer
        if name in style_layers:
            target_feature = model(style_img).detach()
            style_loss = StyleLoss(target_feature)
            model.add_module(f'style_loss_{i}', style_loss)
            style_losses.append(style_loss)

        # Add content loss module if at the correct layer
        if name in content_layers:
            target = model(content_img).detach()
            content_loss = ContentLoss(target)
            model.add_module(f'content_loss_{i}', content_loss)
            content_losses.append(content_loss)

    # Trim the model after the last content and style loss layers
    for i in range(len(model) -1, -1, -1):
        if isinstance(model[i], (ContentLoss, StyleLoss)):
            break
    model = model[:(i+1)]

    return model, style_losses, content_losses

Step 7: Input Image Initialization

We need an initial image to optimize. We can start with the content image or a random noise image.

input_img = content_img.clone()
# Or start from random noise:
# input_img = torch.randn(content_img.data.size(), device=device)

Step 8: Define the Optimizer

We use an optimizer to adjust the pixels of the input image to minimize the total loss.

# We want to optimize the input image pixels
input_img.requires_grad_(True)
model, style_losses, content_losses = get_style_model_and_losses(
    cnn, normalization_mean, normalization_std, style_img, content_img
)

optimizer = optim.Adam([input_img])

Step 9: Run the Style Transfer

We perform the optimization loop, updating the input image to minimize the total loss.

num_steps = 300
style_weight = 1e6  # β
content_weight = 1  # α

run = [0]
while run[0] <= num_steps:

    def closure():
        # Correct the values of the updated input image
        input_img.data.clamp_(0, 1)

        optimizer.zero_grad()
        model(input_img)
        style_score = 0
        content_score = 0

        # Compute style and content losses
        for sl in style_losses:
            style_score += sl.loss
        for cl in content_losses:
            content_score += cl.loss

        # Total loss
        loss = style_weight * style_score + content_weight * content_score
        loss.backward()

        run[0] += 1
        if run[0] % 50 == 0:
            print(f"Step {run[0]}:")
            print(f"Style Loss : {style_score.item()} Content Loss: {content_score.item()}")

        return style_score + content_score

    optimizer.step(closure)

# Clamp the values to be between 0 and 1
input_img.data.clamp_(0, 1)

Step 10: Display and Save the Result

Finally, we can display and save the generated image.

images = [
    to_pil_image(style_img.detach().cpu().squeeze()),
    to_pil_image(content_img.detach().cpu().squeeze()),
    to_pil_image(output.detach().cpu().squeeze()),
]
titles = ["Style", "Content", "Output"]
from mpl_toolkits.axes_grid1 import ImageGrid
# plt.figure(figsize=(10,6))
fig = plt.figure(figsize=(10, 6))
grid = ImageGrid(
    fig,
    111,  # similar to subplot(111)
    nrows_ncols=(1, 3),  # creates 2x2 grid of axes
    axes_pad=0.1,  # pad between axes in inch.
)

for ax, im, title in zip(grid, images, titles):
    # Iterating over the grid returns the Axes.
    ax.imshow(im)
    ax.axis("off")
    ax.title.set_text(title)

plt.show()

Style Transfer in Action: Artistic Examples

Let’s now see neural style transfer in action! Below are examples of transforming a content image into different painting styles.

Nuances and Additional Details

Adjusting Style and Content Weights (α and β)

By modifying the values of style_weight and content_weight, you can control the emphasis on style or content in the final image.

Higher style_weight (β): The generated image will more closely resemble the style image, possibly at the expense of content structure.
Higher content_weight (α): The generated image will retain more of the content image’s structure but may not fully capture the style.

Layer Selection Impact

The choice of layers in content_layers and style_layers significantly affects the outcome.

Content Layers: Typically higher layers (e.g., ‘conv_4’) capture more abstract representations of the image content.
Style Layers: Using multiple layers from lower to higher layers captures style at different scales.

References and Acknowledgments

This work is inspired by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge’s groundbreaking paper on neural style transfer, titled “A Neural Algorithm of Artistic Style” [arXiv:1508.06576v2].