Automated Storyboard Generator for Advertisement Campaigns

11 min readFeb 19, 2024

In the dynamic world of digital advertising, the rapid evolution of machine learning, natural language processing (NLP), and computer vision technologies has fundamentally transformed the way we create engaging and impactful campaigns. These cutting-edge tools now enable advertisers to decode intricate data, seamlessly blend textual concepts with visual narratives, and craft ads that truly resonate with their audiences.

This transformative shift in advertising reflects a deep understanding of the potential of technology to streamline and elevate the ad creation process. A key focus of this endeavor is the development of a machine learning solution that can automatically translate textual ad concepts and asset descriptions into visually compelling storyboards. This innovative solution will analyze provided concepts and assets, generate relevant visual and textual elements, and seamlessly weave them together into coherent ad frames. The ultimate goal? To create a comprehensive storyboard that captures the essence of the proposed ad campaign.

To achieve this, we’ll be leveraging a powerful combination of natural language processing and computer vision techniques. Our solution will generate images based on textual descriptions, analyze the generated text to extract entities, keywords, and sentiments, and expertly compose these elements into individual ad frames. Each frame will effectively encapsulate the core message of the advertisement campaign, ensuring a cohesive and impactful narrative.

This solution will not only be highly effective but also scalable, capable of handling a large volume of requests. It will be deployed as a service that can seamlessly integrate into existing infrastructure and will be continuously updated to incorporate new advertisement concepts and assets. With this approach, we aim to revolutionize the way advertisers bring their campaigns to life, leveraging the power of technology to create truly compelling and memorable ads.

Shortened Technology Overview

Data Processing Libraries: Libraries like Pandas, NumPy, and Scikit-learn enable efficient handling of large datasets, improving model performance through preprocessing and feature engineering.

Model Evaluation Libraries: Tools like TensorFlow Model Analysis (TFMA) and Scikit-learn provide crucial metrics and visualization for assessing model accuracy and generalizability.

Visualization Libraries: Matplotlib and Seaborn offer customization options for creating impactful visualizations from generated images and storyboards.

NLP Libraries: Tesseract, OpenCV, and TensorFlow facilitate tasks like named entity recognition, vital for creating effective visual and textual ad components.

Computer Vision Libraries: OpenCV and Yolo empower image processing and analysis, enabling tasks like image segmentation and object detection for tailored visual assets.

Image and Text Generation Models: Tools like Automatic1111, Fooocus, Stable Diffusion, DALL-E, and AttnGAN use deep learning models to generate high-quality images from textual prompts, aligning with brand identity and advertising objectives.

Machine Learning Frameworks: Industry-standard frameworks like TensorFlow, PyTorch, and Keras simplify the development and training of machine learning models for image generation and text-to-image synthesis.

Project blue print

Identify the Importance of Realism in Advertising: Research and understand the significance of realism in advertising, including its impact on consumer perception and brand credibility.

Utilize Image Generation Models and APIs: Explore and implement image generation models and APIs to create high-quality images that align with advertising concepts.

Explore Text Generation Strategies: Investigate and experiment with various text generation strategies, including algorithms and NLP techniques, to create engaging textual content for advertisements.

Compose Aesthetic Ad Frames: Develop a method to arrange visual elements, such as images and text, in a visually appealing and coherent manner, considering factors like color, typography, and layout.

Build an Engaging Storyboard: Create a storyboard that outlines the sequence of scenes and key elements of the advertisement, ensuring it captures the attention of the target audience and communicates the brand’s message effectively.

Research Real-World Applications and Success Stories: Study real-world applications and success stories of advertising campaigns that successfully utilized image and text generation strategies, drawing insights and inspiration for future projects.

Project Implementation

Exploring Data

As we delve into the data exploration journey, our first steps revolve around ensuring data quality. This involves addressing missing values and identifying outliers, ensuring the integrity of our dataset. Summary statistics, such as mean, median, variance, and standard deviation, help us grasp the distribution of our data.

The next phase, univariate analysis, entails examining each variable’s distribution independently. This is followed by bivariate analysis, where we explore relationships between pairs of variables. Techniques like scatter plots, correlation matrices, and cross-tabulations allow us to detect patterns and correlations among the variables.

Moving on to multivariate analysis, we employ methods like cluster analysis and principal component analysis to uncover relationships among multiple variables simultaneously. This helps us gain a holistic understanding of our data’s structure.

Our dataset comprises four columns per ad folder:

ad_id: Unique identifiers for each advertisement, crucial for tracking and managing individual ads within the dataset.

preview_link: Links or paths to previews of the advertisements, serving as visual representations of the ads for quality assurance or stakeholder showcase purposes.

ER (Engagement Rate): A key metric representing the projected average revenue earned from each user who installs our app through our ad campaign. A higher ER indicates a more targeted and profitable ad campaign.

CTR (Click-Through Rate): The percentage of people who see our ad and then click on it. A higher CTR indicates a more engaging and relevant ad, effectively capturing user attention.

Side-by-side KDE plots for the columns ‘ER’ and ‘CTR’

Upon closer examination, we’ve discovered some noteworthy insights:

539 ads in our dataset achieved an Engagement Rate (ER) of precisely 0.1%.
44 ads have generated at least 0.1% Click-Through Rate (CTR).
These metrics, ER and CTR, are pivotal for assessing the effectiveness of our mobile ad campaigns, offering valuable insights into user engagement and revenue generation. As we continue to analyze and visualize our data, we strive to unlock deeper insights that will guide our future campaign strategies and drive success.

Importance of Realism in Advertising

Automatic1111 is an AI-powered image generation tool that utilizes the cutting-edge DALL-E model. DALL-E, developed by OpenAI, is a transformer-based model designed specifically for generating images from textual prompts. The model is trained on a vast dataset of images and their corresponding textual descriptions, enabling it to understand and translate textual prompts into visually coherent and contextually relevant images.

The key innovation of DALL-E lies in its ability to generate highly detailed and diverse images that go beyond the capabilities of traditional image generation models. Unlike previous models that rely on simple text-to-image mapping, DALL-E leverages the power of transformers to capture complex relationships between words and their visual representations. This allows the model to generate images that are not only accurate representations of the input text but also exhibit creativity and imagination.

Fooocus is an innovative image-generating software that we’ve selected based on its unique combination of features and capabilities. Drawing inspiration from the successful projects Stable Diffusion and Midjourney, Fooocus offers users an offline, open-source, and free platform. By learning from Stable Diffusion, it ensures accessibility and transparency, while Midjourney’s influence means manual tweaking is unnecessary, allowing users to focus solely on prompts and images. Fooocus represents a powerful and user-centric tool that aligns with our project’s goals and values.

Utilizing Image Generation Models and APIs

Fooocus is an advanced image-generating software that allows users to create stunning images by simply providing a textual prompt. The software is inspired by Stable Diffusion and Midjourney and incorporates the best features of both. Similar to Stable Diffusion, Fooocus is offline, open-source, and free, making it accessible to everyone. From Midjourney, Fooocus inherits the ability to generate images without the need for manual tweaking, allowing users to focus solely on their prompts and images.

The API for Fooocus follows common HTTP semantics and is compatible with various programming languages. To get started, users can use the provided code snippets or refer to the documentation for more information on how to authenticate, manage long-running requests, and more.

The input to the API consists of a textual prompt, a negative prompt (if needed), the desired image style, and other optional parameters such as performance level, guidance scale, sharpness, aspect ratio, number of images to generate, and more. Users can also provide control images or masks for the generated images.

Prompt Engineering for AI Image Generation

In AI Image Generation, Prompts play a crucial role in guiding the AI model toward producing the desired result. However, achieving the intended output often requires multiple attempts to iteratively refine the imagery. Prompt Engineering refers to the practice of optimizing these prompts to generate the desired outputs efficiently.

Word Count:
The more words used in a prompt, the less each word “weighs” or contributes to the final output.

A shorter prompt allows for clearer concepts, while a longer prompt allows for more specificity.

Original prompt:

"A 'Play Now' button, styled to resemble a LEGO brick, invites users to join the challenge on the LEGO website. The button is strategically placed to be easily noticeable and accessible, encouraging viewers to take immediate action."

Prompt engineered prompt:


"An eye-catching 2D 'Play Now' button, meticulously designed with the familiar shape and color of a LEGO brick, serves as an inviting gateway for users to immerse themselves in the latest challenge hosted on the LEGO website. Positioned in a prominent location, the button is carefully placed to capture the viewer's attention and encourage immediate interaction. Its distinct appearance, reminiscent of the beloved building blocks, evokes a sense of nostalgia and playfulness, further compelling users to click and embark on the journey of creativity and fun offered by LEGO's online platform."

The output of the API includes the generated image file info, timings for the generation process, and a Boolean indicating whether the generated images contain NSFW concepts.

Composing Aesthetic Ad Frames

What is YOLOv7?

YOLOv7, short for “You Only Look Once,” is a popular real-time object detection algorithm developed by Joseph Redmon, Ali Farhadi, and Santosh Divvala in 2016. It stands out for its speed, accuracy, and generalization capabilities, making it a preferred choice for real-time computer vision applications.

Key Features of YOLOv7:

Mean Average Precision (mAP): YOLOv7 achieves higher mAP than other real-time object detection systems.

High Detection Accuracy: It demonstrates superior detection accuracy with minimal background errors.

Generalization: YOLOv7 shows better generalization for new domains, making it suitable for diverse applications.

Open-Source: Its open-source nature encourages community contributions, refining the model over time.

YOLOv7 Architecture:

Convolutional Layers: Extract features and capture spatial information.

Batch Normalization: Stabilizes training by normalizing the output of each layer.

Dropout: Prevents overfitting by randomly ignoring neurons during training.

Training YOLOv7 on custom dataset

We have categorized our top 100 CTR and ER datasets, acknowledging their importance for later use in locating the correct position of assets during frame generation.

Create Train, Valid, and Test sets

We created the Train, Valid, and Test sets. Here we created separate lists of image paths for Train, Valid, and Test sets. These will be used in our Dataset class which will be defined for a custom dataset.

import os
import shutil
import numpy as np

# Define the directories
data_dir = 'data'
train_dir = 'train'
val_dir = 'val'

# Create the directories if they don't exist
if not os.path.exists(train_dir):
    os.makedirs(train_dir)
if not os.path.exists(val_dir):
    os.makedirs(val_dir)

# Get a list of all the image files in the data directory
files = [file for file in os.listdir(data_dir) if file.endswith('.jpg') or file.endswith('.png')]

# Shuffle the files
np.random.shuffle(files)

# Split the files into train and val sets
split_point = int(len(files) * 0.8)  # 80% of data for training
train_files = files[:split_point]
val_files = files[split_point:]

# Move the files and their corresponding label files into the train and val directories
for file in train_files:
    shutil.copy(os.path.join(data_dir, file), train_dir)
    label_file = os.path.splitext(file)[0] + '.txt'
    shutil.copy(os.path.join(data_dir, label_file), train_dir)

for file in val_files:
    shutil.copy(os.path.join(data_dir, file), val_dir)
    label_file = os.path.splitext(file)[0] + '.txt'
    shutil.copy(os.path.join(data_dir, label_file), val_dir)

We utilized Python to partition our data, allocating 80% for training purposes and reserving the remaining 20% for validation.

Training the model

After setting the parameters we then progressed on to train the model.
Following the training of our model, we employed it to classify the assets identified within a frame. This step enables us to accurately predict the optimal placement of assets when generating a frame.

Once the image has been analyzed and categorized as one of the 23 specified categories, we proceed with background removal to ensure the accurate merging of the identified objects.

from rembg import remove
from PIL import Image
import os

class BackgroundRemover:
    def __init__(self, input_path, output_directory):
        self.input_path = input_path
        self.output_directory = output_directory

    def remove_background(self):
        inp = Image.open(self.input_path)
        output = remove(inp)
        
        # Get the input filename without the extension
        input_filename = os.path.basename(self.input_path)
        input_filename_without_extension = os.path.splitext(input_filename)[0]
        
        # Create the output filename using the input filename
        output_filename = f"{input_filename_without_extension}_removed.png"
        
        # Save the output image in the output directory with the new filename
        output.save(f"{self.output_directory}/{output_filename}")

Using the Vertical and Horizontal Positioning dictionaries, we determine the placement of our images when composing our storyboard. The dictionaries list the position preferences for each image category, ensuring the correct arrangement during the generation process.

Storyboard generation

It takes a list of PIL images and optional parameters for separation space, vertical padding, and background color. The resulting combined image is returned as a new PIL image object. In the example, two images are combined horizontally, and the resulting image is saved as combined_image.png.

Results and conclusion

Working Model for Image Identification: A model was trained using YOLOv7 to identify parts or images from specified categories within a given dataset. This model demonstrated high accuracy and generalization for real-time object detection tasks.
Prompt Engineering for Better Output: Utilizing prompt engineering, the AI model was guided to produce more refined and contextually relevant images based on textual prompts. This process involved optimizing prompts to efficiently generate desired outputs, enhancing the model’s performance.
Background Removal for Frame Assembly: A Background Remover class was developed to remove backgrounds from images, ensuring accurate merging of identified objects during frame assembly. This step was crucial for creating visually appealing and coherent ad frames.
Storyboard Generation Based on Generated Frames: The generated frames were used to create a storyboard that outlined the sequence of scenes and key elements of the advertisement. This storyboard effectively captured the essence of the proposed ad campaign, ensuring a cohesive and impactful narrative.

Future works

Dynamic Asset Positioning: Enhance the frame generation process by creating a more dynamic positioning system for the generated assets. This could involve analyzing the content of each asset and its relationship with other assets to determine the best position within the frame.

Robust Resizing: Develop a more robust resizing mechanism by analyzing the ratio of the generated asset position compared to the desired position. This would ensure that the assets are resized appropriately to fit within the frame without compromising the overall composition.