The Future is Synthetic
In recent years ‘synthetic media’ has emerged as a catch-all term used to describe video, image, text and voice that has been fully or partially generated by computers. The ability for AI-driven systems to generate audiovisual content is, in our minds, one of the most exciting developments enabled by recent progress in deep learning. We are about to see a major paradigm shift in media creation and consumption that will likely change the equation for entire industries.
Synthetic media will significantly accelerate creative expression and lessen the gap between idea and content. It will bring with it new methods of communication and storytelling, enable unprecedented human-computer interfaces and challenge our perception of where the digital realm begins and ends.
We founded Synthesia in 2017 to take the lead in bringing these new technologies to the world. You might have seen our recent work with David Beckham speaking 9 different languages, exemplifying how AI will empower humans to communicate in new, better ways.
But synthetic media also brings with it questions around how we consume and contextualise media. The societal impact of this new category of technologies has been a hot topic in the press over the last two years. The ability to generate Hollywood-grade, or better, visual effects without the budget, time or skill traditionally required opens up the potential for both good and bad.
Our mission is to build synthetic media technologies that maximise human creativity and minimise harmful use of AI. This post outlines our vision for the future of synthetic media and has been born from conversations with researchers, entertainment producers, news organisations, government, celebrities, big tech, advertisers, politicians, NGOs and beyond over the last three years.
From the beginning of time the way in which we create and share our stories has been in constant flux: we’ve gone from cave paintings and printing press to internet, cameras, PhotoShop and Snapchat filters.
All these technologies for creative expression and communication has had a significant impact on society and human relations. Mostly for good, but also for bad. Creative expression is a powerful tool to spread and communicate important ideas and causes but it is an equally powerful tool to spread misinformation and polarisation. This has been true throughout history for each iteration of media technologies; criminals enjoy the benefits of telephones and the internet as much as any law abiding office worker.
There are two sides to creative expression: creation and distribution. They are both equally important. An amazing film has no impact on the world if it isn’t seen by anyone.
Over the last two decades the internet and New Media has democratised content distribution. Blogs and social media has made us all 24/7 broadcasters for free. In today’s world the challenge is no longer to be published, but to be discovered in the deluge of content available online.
Content creation has also advanced, but arguably at a slightly slower pace. Camera phones made us all professional photographers. Music synthesisers enabled a whole generation of bedroom producers to create chart-topping hits on a laptop. Unity and Unreal enabled indie game developers to compete with the behemoths. Snapchat filters gave us mass-scale visual effects for fun and Bitmojis to represent ourselves in the digital realm.
Consider Photoshop. It changed the equation of the entire media industry by introducing a much more scalable, digital and non-linear creation process for producing an image. We can remix images together, add/remove objects, change the appearance of people and experiment with the visual look and feel through a slider-based interface.
Because the creation process has been democratised the internet is now abundant with high-quality imagery and photography from all corners of the world. Just look at Tumblr or Instagram. The marginal cost of producing images is approaching zero which fuels creativity and brings forth talent from all over the world.
Professional video production has not seen the same advancements. Generally speaking we still produce film, tv and advertising the same way we did 15 years ago: actors, cameras, visual effects, post-production. The difference between what a Hollywood studio and a YouTube creator can do is still enormous.
Why is that? Unlike other forms of media such as image or sound we can’t easily edit or synthesise new video without extremely expensive visual effects.
Video is still a linear medium bound to an inherently analog and physical creation process. These limitations mean that video is still primarily used for mass-communication: personalising, translating and making interactive video content is still prohibitively expensive. And the results, particularly translation, often severely impacts the creative quality of the content in a negative way.
But we are right now at the beginning of a major paradigm shift in content creation. Advances in deep learning are enabling magic: neural networks can increasingly reproduce media content at such a high fidelity that it is almost impossible to tell apart from content created by humans. This goes for text, image, voice and video.
It also turns out that neural networks are really good at one of the traditionally most difficult tasks in digital content creation: reproducing human likeness. Creating believable digital humans, both visually and in voice, is extremely hard to do as we’re hardwired to spot even the slightest errors. This is what is usually referred to as the uncanny valley and is why most digital characters in Hollywood films still feels ‘weird’.
Advances in AI are now making the process of creating human voice and video not just much faster but also at a much higher fidelity.
These technologies will enable people to bring their ideas to life in a whole new way. Companies will be able to produce 10x the assets they do today at a tenth of the cost. Mass communication will increasingly become a thing of the past and Hollywood will face global competition as the price for visual storytelling plummets.
Digital assistants will actually look and sound real, virtual celebrities will live on forever and games will feel much more lifelike than they do today.
Our thesis is that over the next decade synthetic media will bring with it fundamental shifts in three areas: content creation, IP ownership and security & verification.
We are excited to take part in shaping this new, synthetic future.
1 — Content Creation & Communication
The cost and skill barriers to creating visual content will evaporate. This will change how we communicate with each other; both on a personal level and in a media context.
As humans we are always striving to communicate through more and more high-contextual media that is closer to a real encounter. The popularity of Instagram, Snapchat, FaceTime and similar platforms are a good indication of how we’re progressing towards visual communication as the default.
Synthetic media will take grainy smartphone video to truly professional quality and beyond. The language barrier will disappear. High-quality visual effects, photoreal avatars, chatbots and interactive videos will be the new default. Meme creators on Instagram will be able to create the kind of content you would see on TV today. Slowly we will use cameras less and move to a more digital creation process, like photos today.
In 10 years the Marvel films of today will be created by film students. Just as with other creative disciplines this will allow people from all over the world to bring their ideas to life without the traditional gatekeepers and intermediates in Hollywood. Good stories will emerge from everywhere; but because the tools are now freely available the creative execution and storytelling will be more important than ever. Visual effects will no longer save a mediocre movie.
We believe the next generation of tools will be much more focussed on art direction as opposed to detailed artistry and pixel manipulation. At Synthesia we are excited to build video synthesis tools that accelerate this creative expression and ultimately democratises content creation.
2 — IP Ownership
When anyone can synthesize anything, what does that mean for IP ownership? Traditionally actors are paid for their time and physical presence. But in the not so distant future you might be able to pay a small fee to have your kids favorite celebrity teach them programming or have Morgan Freeman read out your next audiobook.
That might sound strange, but if you look at how celebrities monetize their likeness today it is not that far fetched: Kim Kardashian reportedly made $40m on her video game (which was a skin to another game) and many celebrities today have their own brand of digital products.
Synthetic media will allow celebrities to scale up their content creation and produce personalised and unique content for their fans. We are already working with celebrities to help them reach a larger, global audience by translating their video content into other languages seamlessly.
It will also be possible, more so than today, to create virtual beings that don’t exist. Lil Miquela, the world’s most popular virtual influencer, today has a team of visual effects artists that manually create the content. Soon the ability to create these avatars will be democratised and we’ll see a surge in this new type of IP and storytelling. Disassociating fictional characters from real actors will be an interesting shift.
Is it Harry Potter or Daniel Radcliffe that people all around the world love? And what would it mean if there was no Daniel Radcliffe, only a virtual Harry Potter?
As we move into this new world of synthetic media it is important that everyone has control over their likeness and can control when and where they are being synthesized.
A core part of the Synthesia mission is to ensure a digital consent / record is in place, not just for actors, but for all of us. Being in control of your real-world individual likeness (and in the future all IP of the characters you create) will be important to enforce through technology.
3 — Security / Verification
Synthetic media technologies also bring with them challenges around truth, identity and media provenance. How do we protect our voice and visual likeness from being abused by people with malicious intent if anyone can create anything? We have already seen the issues that arise when everyone can broadcast anything through social media platforms that often amplify emotional and provocative content rather than the truth.
As a society we will both have to educate the public and build technologies that can contextualise the media we consume. Synthesia are also working to push forward industry standards and implement ethics as a core part of the company.
Soon we will need to be as critical towards videos as we are towards photos today and not assume that what we see is necessarily real. This education will take time. We believe the best method to educate is exposure: once people start watching celebrities speak foreign languages, interacting with virtual avatars and create their own synthetic media they will increasingly become aware of these technologies.
But we also need to develop tools that can automatically contextualise and verify the provenance of the media we consume. This has to be a collaborative effort between many different parties. Solutions that work in isolation will not be good enough.
Technically we think it’s important to separate between the two main areas of security: detection of fake videos and verification of authentic content.
Detection is a simple idea: train AI systems to recognise synthetic media eg. DeepFakes. Synthesia co-founder Matthias Niessner recently published FaceForensics++ a state-of-the-art system for deepfake detection, developed in his lab at TUM. We believe this potentially will be a good short term solution, but it will be an arms race.
More importantly, we believe that in 5 years almost all content will be synthetic, just like almost every image today has been through PhotoShop. By far the majority of that content will be positive, like it is today.
We think the much more important question for society in 3 years will be if the content is consensual or not. This could happen through a protocol that allows content producers to fingerprint their content and for consumers to verify authenticity automatically. Ideally this would be some sort of open protocol that would be adopted widely on the internet like SSL has been for financial transactions. It is by no means a perfect solution but it would be a significant step forward in terms of verifying video, particularly from official sources.
The technical challenges of building verification tools are still in the R&D domain. Historically the issue with watermarking and fingerprinting content is that it only works within a defined ecosystem and that it breaks down as videos are compressed, trimmed and embedded into other videos (eg. a YouTube blogger editing in a video from the BBC).
In order to solve these problems we believe that the starting point is the ability to synthesise videos in very high quality. We’re actively working on technology that utilises AI to recognise and fingerprint video in way that works ‘in the wild’. In future blogposts I will expand on our thinking and work in this area.
At Synthesia we are excited about this new future we are moving into and we’re aware of responsibility we have as a company. It is clear to us that artificial intelligence and similarly powerful technologies cannot be built with ethics as an afterthought. It needs to be front and centre, an integral part of the company: reflected in both company policy and in the technology we are building.
In future blogposts I will expand on our thinking in these three areas. To follow our progress visit our website.