Creating Snapchat AR Lens with AI Video Generator and Lens Studio

Designium

Published in

Designium StoryHub

16 min readFeb 6, 2024

Content

Introduction
Snapchat AR Lens — ”FaceFusion”
AI Tools — “DALL·E 3”, “Runway” and “Adobe Firefly”
<Step 1> AI Generated Illustrations
<Step 2> Correct AI Generated Images
<Step 3> Create Videos with AI Video Generator
<Step 4> Set Up in Lens Studio
Conclusion
— — — — — — — — — — — — — — — — — — — — — — — — — — — —
※ Regarding Rejection of “India <IN>” Lens Submission
Editorial Note

Introduction

Hi! I am Tomoya AKIYAMA (@AtomAkym), a CG Artist at Designium.

The pace of progress in Generative AI technologies is accelerating each day. At Designium, we are actively experimenting with the application of Generative AI to various aspects of Augmented Reality (AR) in our Research and Development. One striking example of this innovative approach is our project “Magical Forest”, a captivating experience consulting ChatGPT to generate stories and using Midjourney to create 3D textures and graphic design materials.

The story behind the production of “Magical Forest”

Using generative AI for WebAR development

medium.com

In our latest endeavor, we delved into the use of AI Video Generator with AR. Similar to how AI Image Generator crafts new images and illustrations based on prompts and reference images, AI Video Generator possesses the capability to generate videos from the same types of prompts. Taking this innovative tool, we seamlessly integrated it into our creative process, crafting AR Lenses using Lens Studio — an advanced development environment provided by Snapchat.

Snapchat AR Lens — ”FaceFusion”

As a result, we created four lenses with the title “FaceFusion.” Lenses include Japan <JP>, the United States <US>, France <FR>, and India <IN>. Users can immerse themselves in the AR experience and change their appearance into iconic figures from each country by changing their faces. For example, users can use the “Japan <JP>” lens to playfully fuse their faces with illustrations of sumo wrestlers, maikos, or samurai.

Demo of the Snapchat Lens “FaceFusion”

The visual content, including images and videos, serving as the foundational elements for these lenses, is entirely generated through the capabilities of Generative AI. Illustrations were crafted using DALL·E 3 in ChatGPT, then generating videos using Gen-2 in Runway. Along the way we also used Adobe Firefly with Photoshop for additional refinements to achieve the desired outcome. Finally, use Lens Studio to seamlessly integrate these elements into the AR Lenses.

👉 If you’re interested, please try it on Snapchat with the Snapcodes below.

※ We developed four lenses. However, the submission for the lens "India<IN>" 
was rejected. The reasons will be explained at the end of the article.

🔼 Back to the Top

AI Tools — “DALL·E 3”, “Runway” and “Adobe Firefly”

Before delving into the production process, let’s take a quick look at each of the AI tools used in this project.

DALL·E 3

DALL·E 3 is an image generator of ChatGPT. Setting it apart from other AI generators like Stable Diffusion and Midjourney, which require thoughtful consideration and experimentation with different prompts for image generation, DALL·E 3 allows users to provide prompts in their preferred language and text format and generate images through dialogue. DALL·E 3 simplifies the process, allowing creaters to naturally convey their ideas and making image creation more intuitive.

Runway

As of January 2024, there are other AI video generators besides Runway, but Runway is the pioneer in this field. Runway provides a feature called “Text/Image to Video,” this AI service empowers users to create videos by inputting prompts or specifying camera movements through the user interface. Additionally, users can use existing illustrations and add movement to them just like we did in this project.

Another noteworthy feature, “Frame Interpolation,” enables the creation of videos seamlessly connecting two distinct images. There is in-depth explanation in the subsequent chapters. Overall, I leveraged a combination of the “Text/Image to Video” and “Frame Interpolation” to produce the materials for this project.

Adobe Firefly

The well-known software Adobe Photoshop also has an AI Image Generator called “Adobe Firefly.” I use “Generative Fill” to refine the illustrations generated by DALL·E 3. Even if the generated illustrations might have unnatural objects or colors in the background, or anomalies in the depiction of the human body, in such case this feature can make precise corrections that seamlessly blend with the illustration. This ensures a more natural and aesthetically pleasing result. I used to rely on the “Content-Aware Fill” previously, but “Generative Fill” makes the blend fit more naturally into the image.

Next, I will introduce the production process using these three AI generators, with special emphasis on the use of DALL·E 3 and Runway.

🔼 Back to the Top

<Step 1> AI Generated Illustrations

First, I started generating illustrations using the DALL·E 3 while thinking in the following order.

Write Good Prompts

The image generation process of DALL·E 3 is no different from the usage of ChatGPT. If you search on the internet, you can find various resources on how to write good prompts. You’ll then find that when providing image generation prompts to ChatGPT, it’s always best to consider three aspects:

PURPOSE
REQUIREMENTS
RESTRICTIONS

Here, I’ll provide an example of generating an illustration of a soccer player to explain how framing prompts can generate an image that closely aligns with your imagination.

PURPOSE

In this project, the purpose is to “generate an illustration of a soccer player.” By expressing this purpose clearly, ChatGPT will automatically utilize DALL·E 3 for image generation as needed. If the prompt for generating an image is absent or unclear, ChatGPT might return text instead of an image. Therefore, it’s crucial to explicitly convey the “create an image” prompt to ensure accurate output.

REQUIREMENTS

I recommend considering the following five aspects in the prompt:

SUBJECT

It’s best to delve into the details of the person or object in the image such as “A French soccer player standing”, “Wears the uniform”, “Holds the ball with both hands under the chin”, etc.

CAMERA ANGLES

Details of the camera angle show how the subject and background are framed, such as “Only showing the upper body of a soccer player” and “The soccer player’s face appears large in the center of the image.” When dealing with a person, you can specify options such as “Long shot”, “Full body”, “Upper body”, “Face close up” and generate the closest match.

BACKGROUND

You might want to specify the scenery behind the subject, such as “A soccer stadium filled with spectators for the background.”

COLOR TOUCH

Color Touch can be used to establish the overall atmosphere of the image, such as “Paint textures as detailed as live action movies.” In this case, the purpose is to generate an illustration, so even though we refer to live action movies, the result is still an illustration.

IMAGE SIZE

Image size refers to whether the image should be portrait (9:16, 1024 x 1792 px) or landscape (16:9, 1792 × 1024 px). If no specific prompt is given, the default output will be a 1024 x 1024 pixel square image.

Can't Think of a Prompt...

※ If you're not sure what to include, consider completing only the aspects
that are clear to you and outputting them as is, rather than trying to 
force-fill in every detail. The image you will get may not exactly match 
your requirements, but it will highlight areas that need improvement. 
Seeing images with identifiable improvements makes it easier to refine 
prompts. I recommend giving it a try, even if it's just a random attempt. 
You can respond to DALL·E 3 with any enhancements you make, or consider 
modifying the original prompt for re-output.

RESTRICTIONS

If the illustration still doesn’t look as expected after adjusting the prompt, consider adding some specific restrictions, such as “Don’t cover the soccer player’s face with the ball.”

Considerations Before Adding Restrictions on Prompts

Use positive descriptions rather than negative sentences.

Even though I described it in a negative way such as “Don’t cover the soccer player’s face with the ball,” the results seem logical. However, ChatGPT occasionally misunderstands them, forgets the negative meaning, and only retains the “hide face” command, thus affecting the image output.

While negative sentences are not bad, it is recommended to experiment and confirm their effectiveness through trial and error. The variance in how prompts are interpreted could stem from the communication capabilities of ChatGPT. This is an issue with ChatGPT, unlike Stable Diffusion or Midjourney, which do not have this issue.

🔼 Back to the Top

<Step 2> Correct AI Generated Images

Second, I think it’s important to touch upon the subject of retouching with Photoshop. While there are numerous online resources, videos, and books detailing advanced techniques for enhancing image aesthetics, I’d like to share some insights specifically regarding “Generative Fill.”

I would like to take the correction of the Napoleon illustration as an example. When generating images of Napoleon using DALL·E 3, various compositions emerged, including scenes of him riding a horse with elevated front legs and a raised right hand (The famous painting of Napoleon Crossing the Alps. This is what many people think of when they hear the name Napoleon.) While this composition has its merits, my goal was to show a clear face using in the AR lens. This resulted in a series of trials and errors. The closest output I ended up with is as follows. For comparison, a corrected image of Napoleon is also juxtaposed.

As a result (pictured left), Napoleon appears with his right hand in a pose, specifically with the index finger raised. The overall composition (including costumes and background) is ideal. However, this right hand poses a roadblock for AR lenses that fit the face. So I used Photoshop to retouch and entirely remove the right hand (pictured right). Throughout the modification process, no color adjustments or transformations were made. I just used Selection and Generative Fill.

Select the right hand part and execute the Generate Fill function. You can use text to specify a generated theme. If your goal is to adjust it to integrate seamlessly with the original image, you don’t need to input anything. Within a few seconds, the index finger was eradicated, and in its place, the face, hair, and background (excluding the shoulders and mouth) were drawn, fitting the original illustration. However, the texture of the hat appears a little distorted due to the index finger. In this case, simply select the area to be corrected and apply Generate Fill to harmonize it with the original image. In the case of other images, one attempt may not achieve the desired results. Some parts may appear distorted, or extraneous objects may be generated. Therefore, you can also choose incremental adjustments to gradually get closer to the ideal image, rather than selecting the entire area to alter at once. In the case of the Napoleon illustration could also be corrected by dividing it into three parts. First of all, only the part of the hat that is covered by the index finger, then the thumb on the face, and finally the rest of the background.

AI-generated images occasionally contain anomalies, such as an extra finger on the arm or hand. In such cases, it’s better to make a quick correction using Photoshop than to discard it right away due to minor anomalies. I think Generative Fill is a good approach to keep in mind.

🔼 Back to the Top

<Step 3> Create Videos with AI Video Generator

I want to explain the working of two features of Runway.

Text/Image to Video (read here)
Frame Interpolation (read here)

Video Composition & Framing

Before delving into the details, it’s important to outline the predetermined length of this AR lens. The duration of a shot is 5 seconds (30 frames per second (30 fps)), for a total of 150 frames. These 5 seconds are the number that Snapchat assigns based on the recommended length when using videos. So I divided the time as shown below.

There may be nuanced variations among lenses, the fundamental structure follows a consistent pattern:

Three illustrations
Each illustration is animated using the “Text/Image to Video” feature (shown by the range of blue arrows, each containing 39 frames).
Different illustrations are seamlessly connected through the video generated by the “Frame Interpolation” feature (shown by the range of orange arrows, each containing 11 frames).
The “Frame Interpolation” feature is also used to connect the last and first illustrations to form a loop animation.

Following these principles, we created AR lenses tailored to each country. Next I will continue to explain how we use these features in Runway.

Text/Image to Video

I want to use this feature to animate illustrations generated by DALL·E 3. As mentioned above, each illustration was presented for less than 1 second. Excessive movement might make the image unclear, especially if the face is not facing directly forward during the animation. Taking these conditions into account, I decided to keep the subject’s movements small and use a lot of camera movement (like a dolly shot) to smoothly move the camera forward or away from the subject. Achieving this in Runway is very easy, especially with recent updates making it even easier.

The Operating Instructions

[ 1 ] ▶ Select “Text/Image to Video” from the Popular AI Magic Tools on the Runway dashboard homepage. Then you can see the UI as shown below. The following explanation will proceed based on using IMAGE + DESCRIPTION, and the same applies to using IMAGE.

[ 2 ] ▶ Drag and drop the image you want to animate, then you have the option to press the “Generate 4s” button to generate the video or adjust the output before generating.

[ 3 ] ▶ In this project, I adjusted the “Camera Motion” before generating. When pressing the “Camera Motion” button you can see the following UI.

[ 4 ] ▶ The goal is to highlight the characters depicted in the illustration, so I mainly use the Zoom in/out feature. The choice of adjustment method depends on the nature of the image and its purpose.

For showcasing a broad horizontal landscape, consider using Horizontal or Pan.
If you want to emphasize a vertical perspective or a high angle view, Vertical or Tilt will work well.
Zoom is effective when you want to draw attention to people, specific objects, or subjects.
If you aim for a stylish impression as part of a video, Roll can be an excellent choice.

[ 5 ] ▶ By pressing the “Motion Brush” button, you can specify which part of the image you want to animate. Moreover, if you choose IMAGE + DESCRIPTION, you have the option to input prompts to provide movement instructions to improve the accuracy of AI motion generation. It is recommended to use simple sentences rather than single words.

“Motion Brush” Official Instructional Video

[ 6 ] ▶ After checking all adjustments, click the “Generate 4s” button in the lower right corner to generate the video. Below is an example of the output.

The result was a 4 second (120fps) video that I compressed to 39 frames with transparent faces to serve as the material for the AR lens.

In my experience, material that focuses on characters tends to exhibit great accuracy when converted to video. Conversely, images of people at a distance or wearing complex outfits have an increased likelihood of encountering issues during the generation to video. Moreover, even if such discrepancies occur, it is difficult to discern in the video. In the example above with Napoleon, the left shoulder in the image actually has more armor. This output process has minimal and unnoticeable difference, so we use the output video directly without any additional corrections. The most disturbing results occur when the number of fingers suddenly increases. While I expect this problem to decrease as accuracy improves in the future, caution is still required when using it at this stage.

🔼 Back to the Top

Frame Interpolation

Use Frame Interpolation to seamlessly connect the last frame of a video to the first frame of the next video. The amazing thing about this feature is its ability to perform frame interpolation on any image. For example, even disparate images like an apple and a giant robot can be smoothly joined together to generare a new video. Unlike traditional transitions such as dissolves or blackouts that have been used in the past, this approach is a unique form of transition that can only be achieved through AI.

The creation process

[ 1 ] ▶ Prepare screenshots of the first and last frames of the video to be connected as shown below.

※ I want to connect three videos, so I need to prepare six images. 

※ Since I needed a sequence of still images, I exported the footage using 
After Effects (BTW, I also used After Effects to create tracking data to 
specify the face position in Lens Studio).

[ 2 ] ▶ Select “Frame Interpolation” from the Popular AI Magic Tools on the Runway dashboard homepage.

[ 3 ] ▶ Drag and drop the six images to upload and the uploaded images will be automatically arranged from left to right. It’s important to ensure they’re in the correct order, as that order determines the order of morphing.

※ My order goes from the last frame of a lady in royal gowns to Napoleon 
to a soccer player and back to the first frame of the lady.

[ 4 ] ▶ Adjust the “Clip duration” and press “Generate”. In this case, the transition from one illustration to the next spans only 11 frames, which is very short. So the clip duration I set is about 5 seconds.

※ Frame Interpolation does not consume credits, regardless of the length 
of time it spans. Therefore, it doesn't matter how many seconds you set 
in the clip duration.

[ 5 ] ▶ As the result, it will render a preview of the output as shown below.

[ 6 ] ▶ If everything looks good, select “Export” from the settings menu as shown below on the left. Then you will enter the right menu to input the file name and select the format, and resolution to export the file to your Workspace.

※ Choose a clear and descriptive name for the file so you can easily 
identify it later. 

※ Choose the format and resolution based on your plans. I'm used to 
choosing the highest format and resolution.

[ 7 ] ▶ After completing all the above steps, follow the instructions to “Go to assets”. You will then find the video you named earlier among the Assets. Right click to download.

In this way, we successfully generated the video for frame interpolation. Next, we’ll isolate the section that transitions from the end frame to the start frame of this video. Subsequently, we’ll connect this section to the part that links the video generated with Text/Image To Video.

By following this process, the video of one illustration will seamlessly transition to the next, creating a continuous sequence. Repeat this process three times to ensure each illustration video is seamlessly connected, forming a loop that restarts from the beginning.

🔼 Back to the Top

<STEP 4> Set Up in Lens Studio

I configured the materials I’d prepared so far in Lens Studio and published them in Snapchat as AR Lens. The Lens Studio workflow is not described here as it does not involve generative AI. But I wanted to share a stumbling block I encountered during development.

Initially, the intention behind this AR Lens was to directly superimpose facial images captured by the camera onto illustrations. Think of it as a face-fitting panel commonly seen at tourist attractions. However, this approach resulted in the face appearing somewhat detached, floating in the air. We’ve made improvements to make it look more natural and blend in with the illustrations.

To achieve this, we used a technique similar to masking the face of the illustration onto the user’s face and blending it seamlessly with the video of the illustration. As a result, it looks like the face in the illustration is moving, rather than directly seeing the user’s real face, providing a more immersive and less awkward experience when using the AR Lens.

Conclusion

After settling on a format, we can generate a multitude of materials efficiently and deploying them simultaneously using AI Image/Video generators. Moreover, we attempted to develop a prototype video, hinting at the emergence of “unique expression methods of Artificial Intelligence (AI).” While the main mission of AI is to optimize tasks and reduce labor costs, its ability to create novel modes of expression is equally fascinating. Although there are countless discussions about the use of Generative AI, I remain steadfast in my belief that the field of AI technology and expression has unlimited potential. As my capabilities expand, I am committed to exploring and cultivating new avenues of creative expression.

🔼 Back to the Top

Regarding Rejection of “India <IN>” Lens Submission

I thought it would end with “successfully creating and publishing AR Lens”, but unfortunately, AR Lens India <IN> was rejected after submission, preventing me from testing it. AR Lens India <IN> was designed to offer users an interactive experience by superimposing their faces onto three distinct personas: a woman adorned in a sari, a cricket player, and the revered Indian god, Ganesha. Although the reason for the rejection was not specified, I can only guess that it may have been due to the use of “Ganesha”.

Regarding Ganesha, Ganesha is often depicted in various forms and is even used in India as souvenirs or ornaments to decorate vehicles. Although initially included in the AR Lens due to its familiarity, I obviously didn’t fully consider Ganesha’s importance as a religious figure. This oversight may have resulted in a culturally insensitive perception, for which I deeply apologize. As a reflection, I would like to share this experience of rejection.

EDITORIAL NOTE

I am Mary Chin (Chi-Ping Chin), the writer and designer of the PR team at Designium. This is the second article about AI-assisted AR production. The emergence and development of more and more AI tools is indeed a fascinating contemporary phenomenon. Where will AI eventually lead us? Looking ahead to 2024, let’s keep moving forward! 🚀 If you’re interested, be sure to try the Generative AI tools featured in this article! 😉

CONTACT FORM

Designium Inc. in Tokyo

Designium XR | Designium

Designium is an award-winning AR/XR studio that merges technology and creativity to create new experiences

www.designium.jp

LinkedIn / Twitter / Youtube / Facebook / Instagram / Wantedly

Creating Snapchat AR Lens with AI Video Generator and Lens Studio

Content

Introduction

The story behind the production of “Magical Forest”

Using generative AI for WebAR development

Snapchat AR Lens — ”FaceFusion”

AI Tools — “DALL·E 3”, “Runway” and “Adobe Firefly”

<Step 1> AI Generated Illustrations

Write Good Prompts

Considerations Before Adding Restrictions on Prompts

<Step 2> Correct AI Generated Images

<Step 3> Create Videos with AI Video Generator

The Operating Instructions

The creation process

<STEP 4> Set Up in Lens Studio

Conclusion

Regarding Rejection of “India <IN>” Lens Submission

EDITORIAL NOTE

CONTACT FORM

Designium Inc. in Tokyo

Designium XR | Designium

Designium is an award-winning AR/XR studio that merges technology and creativity to create new experiences

Written by Designium