Deepfakes, Snapchat and the Blues Brothers
Creative AI tools for virtual Halloween outfits during the pandemic
It’s nearly Halloween. We’re not expecting the usual hordes of miniature trick-or-treaters this year, whatever stage of lockdown we’re at by the 31st. But it did get me thinking about how to make a virtual outfit rather than dressing up in real life, as more befitting these strange times of video socialising. And rather than the usual monsters and vampires that feel too dystopian, I’ll be recreating the distinctive outfits of Jake and Elwood Blues (John Belushi and Dan Akroyd) from the iconic 80s film the Blues Brothers, a cheerful, escapist and ridiculous narrative thinly connecting a who’s who of blues and soul musicians: Aretha Franklin, James Brown, John Lee Hooker, Ray Charles, Cab Calloway, even Chaka Khan in a gospel choir, as well as the real musicians forming the band.
This is a great excuse to see how easy it is to create a new lens for Snapchat. Snap employs the most cutting edge AI in the name of adding some silliness and fun to communication, and under the garish teenage branding it hides the ambition to be a serious player in the future “metaverse”. I’d been looking for an excuse to try out its Lens Studio for some time. The latest AI techniques are on offer with hair colouring, aging, gender swapping, pet spectacles, morphing famous landmark buildings. Surely it’d be fairly easy to rustle up a black hat, sunglasses, a suit and tie and some sideburns.
In Bret Victor’s genius 2012 talk Inventing on Principle, he says that “creators need an immediate connection to what they create” — to see the effects of decisions immediately, with no delay, nothing hidden. Today we’re used to “direct manipulation” (coined by Ben Shneiderman in 1982) in our computer interfaces — like dragging and resizing windows with a mouse or a finger — but more complex activities like programming software are usually done via typing text commands.
The first set of examples in his talk demonstrate a live preview of game or graphics code that makes code changes immediately apparent. He transformed the normally slow cycle of typing, changing and re-running code to a responsive interface with always-running visualisations. Lens Studio is great at this. Your webcam shows a live preview of your creation as you move your head around, changing in real time as you move objects, adjust parameters, add or remove scripting. It is a faithful realisation of what Bret was asking for, albeit within an insanely complex, poorly documented yet supremely powerful tool.
If Lens Studio is a good example, then Blender is a really poor one. It’s a free and open source 3D creation tool, but clearly for modelling pros. So although the parameters can be changed with sliders and previewed live just like in Bret’s talk, the interface is awash with teeny tiny buttons, controls, mysterious modes and settings. I shouldn’t complain, as it wasn’t designed for novices, but just look at the complexity:
Back to the outfit. Here are some things I learned along the way. You can find 3D models for anything you need. Fedora hat ($2 thanks to kscane); Ray Ban Wayfarer-style sunglasses (Free thanks to JuangG3D). If all you’re doing is attaching things to the head, Lens Studio kind of makes sense (thanks to Apoc). Sideburns are just some hairy texture mapped onto the skull with a template.
The suit and tie were hard but I decided an amateurish version would suffice. Using a cut-out photo on a 2D plane didn’t look so good, as your chest is curved, so I decided to project the photo onto a curved plane. Here’s where I had to dive into the weeds. One cylinder cut out of another with the beautifully simple interface of Tinkercad was quick and easy, making a nice 3D shape. Making the cut-out photo project nicely onto all the resulting triangles involved adjusting what is called (for no good reason) the “UV” mapping, which is where I had to fire up Blender. The final result is not amazing but much better than I expected, and for those of you with the Snapchat app or Snap Camera on your laptops you can now download the Blues Brothers lens (for Snap Camera just paste the link into the lens search bar) and have fun with it.
Bret Victor’s second example has him performing and recording an animation by directly manipulating the objects in the scene: making a leaf flutter and fall, and a rabbit run away, by just moving them how he wants them to move in the final piece. This goes beyond the idea of “no code” into what’s sometimes called “programming by demonstration”. What’s the equivalent for our outfits? Can we do all this without the 3D modelling and insanely complex interfaces? Here we must turn to the murky world of deepfakes.
Techniques from AI and computer graphics that generate or manipulate faces were given a boost, like many aspects of AI, with the development of deep learning techniques from around 2012, and generative adversarial networks (GANs) in 2014. GANs employ two duelling neural networks: one to generate a new example similar to its training set, and the other to critique it, leading to increasingly accurate representations. Face2Face from Stanford, Max Planck and Erlangen-Nuremberg researchers was an early example of realistic real-time face substitution. There is no 3D model at all; this all works purely in with 2D images. An unfortunate use of this set of technologies from 2017 was to substitute faces into pornographic videos, although these “deepfakes” have been subsequently banned across platforms. A different kind of demonstration that received widespread attention was Jordan Peele’s 2018 video, his words animating a realistic Barak Obama (voice as well as appearance), using technology from the University of Washington:
The uses and misuses of this technology have been well documented: we’ve seen election ads translated into languages the candidate doesn’t speak, tools for faster video creation from a script, deepfake audio for spear-phishing attacks, fake faces within social media disinformation campaigns, warnings of an “infocalypse” and more recently significant efforts to detect deepfakes.
A limitation of this kind of deep learning is the need for vast quantities of training data and many hours of processing, making it inaccessible for many. But children can often learn a new word or concept after a single example, so why do AI systems need so many examples? Work has progressed on a new set of machine learning techniques that can learn from very few examples. In “one-shot learning” a system may try to recognise a face after only seeing a single example. It is not surprising that Snap are investing in this approach, as it can bring us closer to the idea of live performance. Sergey Tulyakov, the head of Snap’s creative vision team, along with colleagues from the University of Trento in Italy, created an animation technique published in 2019 to drive a new video from just a single 2D example of a face, building on earlier work on “Monkey-net”. This means we can now do a real time version of the Obama video above, with a single image of Obama’s face as a starting point and no training time required to learn it. It does need a little GPU horsepower as you might expect. Luckily Russian AI software developer Ali Aliev has packaged this into an easy to use open source set of instructions, and indeed also an iOS app. So here we are: a Blues Brothers “outfit” using state of the art live deepfake AI technology, democratised to the point that, with 15 minutes of technology setup, you could run this too for your next videoconference (it can appear as a new “camera” in tools like Zoom):
It is worth reflecting on how far this has come in the last couple of years. The video of Obama was generated by a system trained on 17 hours of footage of his speeches; around 2 million frames. The video above, albeit at vastly reduced quality, runs from a single image. Clearly there isn’t enough information in one image of a face to know how that person would look blinking, smiling, moving their head or raising their eyebrows. The network has to derive that from its knowledge of other faces, and apply it as best it can to the newly supplied face. It is surprising it works at all, let alone in a way that anyone can now experiment with.
The example above ran on an old Mac connected to some cloud GPU (using the free Google Colab service) but soon we should expect it to run entirely inside apps like Snapchat, which already includes fun AI techniques like style transfer, and has the ability to add your own ML models.
To come full circle back to Bret Victor’s principle of creators needing an immediate connection to what they create, this newer piece of work also from Sergey Tulyakov at Snap and this time with colleagues at CTU in Prague won a real-time live best-in-show award at the prestigious SIGGRAPH computer graphics conference this year. It is worth watching the video; it shows how an artist’s sketches of people on paper are instantly translated into moving video, and is an inspiring example of human-AI collaboration to make better creative tools.
In Bret Victor’s Dynamicland space in Oakland the code itself is visual, shared and embedded physically in the environment. Apart from remembering how great the Blues Brothers music was, the conclusion from this exploration is that we are indeed moving closer to a world of creative computing where we can use visual tools and performance more than complex applications and coding.