AI in E-Commerce: Automated Video Generation

Rahul Bajaj

Published in

Walmart Global Tech Blog

6 min readApr 15, 2021

“The first step is to establish that something is possible; then probability will occur.”

~ Elon Musk

The quote comes from the Engineer I admire the most. “The Real Tony Stark”

This is my second article that pertains to the AI track. Here’s the link to my previous article if you haven’t read it already. Yes! Yes! Yes! I am awestruck and fascinated by the magic and huge applications of Artificial Intelligence(also by the “Buzzword”). But it’s time that we understand that neither the world is Hogwarts(Harry Potter reference) nor AI a magical wand. We all are ordinary Muggles trying to use Tensorflow, Scikit Learn, Pytorch, etc. as Felix Felicis(Liquid Luck).

Organizations, these days, are adding Machine Learning and Artificial Intelligence to their applications the way Salt Bae adds salt to its steak. This pic summarises Google I/O 2018 too.

The role of AI isn’t limited to the external business side. It can also be leveraged for better workflow within the organization as well.

The Top 5 commercial sectors employing AI are:

Healthcare
Education
Marketing
Retail and E-commerce
Financial Market and Services

We’ll focus on AI in E-commerce in this article.

The use of artificial intelligence in online shopping is transforming the E-commerce industry by predicting shopping patterns based on the products that shoppers buy and when they buy them. This new wave of eCommerce is changing the way brand-customer engagement takes place as brands continue to refine and innovate their digital strategies. E-com is the fastest growing and evolving channel and hence needs a machine not just to derive insights but also provide recommendations on-an ongoing basis.

Customer experience is not limited to hyper-personalization but takes into account the content quality of the item listings on the e-commerce platform. Generating videos for an immersive experience to buyers is one of the many applications. Utilizing current content for a specified item to kindle a new outlook towards the same product and boost the GMV(Gross Merchandise Value) potential.

You must be thinking — “Why this video?”, “What impact does it make?” **Jibber Jabber**

Visualization works from a human perspective because we respond to and process visual data better than any other type of data. In fact, the human brain processes images 60,000 times faster than text, and 90 percent of information transmitted to the brain is visual. Since we are visual by nature, we can use this skill to enhance data processing and organizational effectiveness.

Answering the question, Yes it does seem like a significant impact. The following pointers will support the argument:

96% of consumers find video helpful when making online purchasing decisions.
79% of online shoppers would rather see a video to get information about a product than read the text on a page.
The right product video can increase conversions by over 80%.
90% of customers say videos help them make buying decisions.
Video can also help reduce negative reviews, by clearly stating what the product offers and what it doesn’t.

In a holistic view, it improves the customer journey via seller listing quality, providing an enhanced shopping experience. Following are the different types of videos produced for offer listing on e-commerce platforms:

Product highlights video
Customer experience video
Explainer video
Comparison video

The amount of effort & cost involved in developing and curating the video can be overwhelming. It can be difficult for the seller to go through the hassle of video production and make it uniform and following the standards of the e-commerce platform. Realizing this challenge, the platforms are providing automated video generation as a service. They are also providing the facility to upload the seller-produced video as a service to enhance the customer experience.

Let’s drop to the part where AI comes into the picture.

Sellers would expect the platform to generate a professional video to attract buyers. The images and the description pointers provided by the seller are constraints to be kept in mind while producing a video.

A finer customer experience would include a piece of upbeat background music which makes the online shopping session even more engaging. Adding AI-generated voice-over would not only be more explanatory but also assists visually-impaired people to be a part of the same customer journey.

An intuitive approach for automated video generation with AI voice-over is described below --

A simple video can be generated by organizing provided images as a sequence with description pointers beside those images, simply using open source libraries like OpenCV. The code for the same can be found here.

For adding AI voice-over, a number of end-to-end neural Text to Speech (TTS) engines are available. Tacotron2 is one of them and has achieved state of the art performance. It works great with short chunks of text but suffers to model long ones. Long texts can be handled by adopting the multi-head attention mechanism to replace the RNN(Recurrent Neural Network) structures within Tacotron2.

Neural Speech Synthesis transformer can be trained using LJ Speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours. The output from the LJSpeech models can then be projected to pre-trained vocoders from WaveRNN and MelGAN. They will act as an audio processor that will capture the characteristic elements of the audio features.

Python code snippet to generate AI voiceover

# Synthesize textsentence = 'Hello World! It's Rahul Bajaj'out_normal = model.predict(sentence)############################# Pushing out put to vocodersys.path.append(MelGAN_path)import torchimport numpy as np##############################vocoder = torch.hub.load('seungwonpark/melgan', 'melgan')vocoder.eval()mel = torch.tensor(out_normal['mel'].numpy().T[np.newaxis,:,:])##############################
if torch.cuda.is_available():  vocoder = vocoder.cuda()  mel = mel.cuda()with torch.no_grad():  audio = vocoder.inference(mel)############################### Display audioipd.display(ipd.Audio(audio.cpu().numpy(), rate=22050))

Python snippet to add audio in the background

import moviepy.editor as mpdef add_music_to_video(path):video = mp.VideoFileClip(path + "/transformed_video.mp4")audio = mp.AudioFileClip(path + "/presentation.mp3")audio = audio.subclip(0,video.duration)print(audio.duration,video.duration)final = video.set_audio(audio)final.write_videofile(path + "/Generated_video.mp4",codec= 'libx264' ,audio_codec='aac')print("Music is added Successfully")

The above set of instructions would help you get an up and running prototype video. The actual challenge would be establishing the semantic relationship between the input image and the corresponding description pointer.

For the above task, I propose a Unified Multi-modal Fusion Architecture.

The model will enable correlative information retrieval to automate the selection of images for a particular description text. The sequence of appearance of each image can either be decided based upon the sequence of description points or the confidence value of each image’s relevance to the information conveyed.

Summing-up

In short, everyone wants to have a proactive approach rather than a reactive one. AI attracts the attention of institutions due to its predictive powers which enable them to have an edge over any situation. While some are just trying to find a better version of Tarot Card Readers😜.

So, you have finally reached the end of AI in E-commerce: Automated Video Generation. It’s time to checkout the learning added to our cart. Hope this article helps understand the impact and provides an intuitive implementation.

Cheerio! Long Live and Prosper…

References & credits:

5 industries that are using Artificial Intelligence the most

Ever since the Industrial Revolution (IR) 4.0 has kicked in, it's been like a rain fire of emerging technologies and…

datafloq.com

Amazon Product Videos: Everything Sellers Need to Know

There are four common types of video most often used for Amazon products. This is clean, simple, and focuses completely…

www.webretailer.com

Humans Process Visual Data Better

Organizations of all stripes, shapes and sizes are drowning in a tidal wave of data. When you look at just how much big…

www.t-sciences.com

Neural Speech Synthesis with Transformer Network

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art…

arxiv.org