The story of a Respeecher startup, which got into Techstars and is now working with a Hollywood TOP-5 film studio
Just imagine: in the advert of English tea, a figure of Winston Churchill speaks Churchill’s voice. Not just akin to Churchill’s voice, but his real one. Prime-minister drinks tea, chirps and puts on his well-known “bowler”. In his voice there is no sign of montage, weird timbre, accent or inarticulateness. Everything is clear-cut and maximally realistic. Instead of Churchill, there can be any living human or historical character, whose voice is saved in recordings. Keanu Reeves, Margaret Thatcher, Victor Tsoi — any person.
A Ukrainian startup Respeecher learnt how to nearly perfectly synthesize voice. These guys already have a contract with a Hollywood studio from the top 5 film companies in Holywood Alexander Serdiuk, CEO of Respeecher, told us about his project, which can potentially make much noise in the film-, video-making and not only.
What is Respeecher?
It is a service that creates new audio tracks on the base of the recorded voice of any person. If you have more than a one-hour recording of a target voice (the one you want to hear as a final result), you can “pronounce” whatever you want.
The technology is based on proprietary methods of deep learning for the creation of audio tracks. This model works by the “speech-to-speech” principle — you download two audio tracks, which contain the records of the same phrases, being said by different voices. Those are the target voice (a speaker who we reproduce) and a source voice (a model will transform voice phrases said by this voice into that target voice). Analyzing recordings of the same content, the neural network understands the difference between these two voices and as a result can make the desired voice out of the other one.
All the emotional content — the flow speed, peculiarities of word pronunciation, voice semantic intonations, accent — are taken from the target voice. This voice apparatus is “transplanted” above a new voice. Now neural network can synthesize the desired voice out of the source voice with the help of the target voice.
Where this technology can be used?
With the help of Respeecher it is possible to:
- Create content with the voices of celebrities or historical characters (films, audiobooks, podcasts, radio programs).
Respeecher works on one of this kind of projects with Massachusetts Institute of Technology. In its framework, together with Canny AI startup, we are recreating a personality of Richard Nixon for the International Documentary Film Festival in Amsterdam. They are recreating the visual part, we are working on the voice. The figure of Nixon will make a speech, dedicated to the landing of Americans on the Moon. It does exist: he wrote it once, but never declaimed publically.
2. Post synchronize and dub films and series, saving the necessary voices.
It is relevant to the situations when actors or dubbing actors cannot physically deal with the amount of work.
3. Post synchronize video games.
Video games need a lot of post synchronization, sometimes up to 30–40 hours. Often, famous actors are not able to spend a week in a studio. Respeecher can give a possibility to talk to another actor with the desired voice of a living or historical character.
4. Change the voice of operators in the call-centre. In a way that they could speak with the desired voice without any accent, or address different voices to different categories of customers.
5. Restore the voice of a person with partial loss of the ability to speak.
What technical challenges has the team met?
There are three main technical tasks which Respeecher’s team has to solve:
1. Master real-time processing. The technology is not able to synthesize voice in streaming mode, because the solution, which we are using now, requires a lot of time to synthesize the speech. Real-time processing is a clear and predictable engineering problem and we plan to solve it within 6 months.
2. Teach the neural network to work on a larger spectrum of emotions. Now, it can easily synthesize a calm normal speech but can make mistakes, if the source voice screams, sings or produces some unnatural noises. Our goal is to generate the whole spectre of possible emotions without any mistake.
3. Make the neural network work on the less amount of incoming data. In a way, that the synthesis would require not 1+ hours of the target voice and the source voice, but much less. This is a question of usability: recording a few hours of speech in a studio can take up to a few days, which is not comfortable for the customer and the actor.
How can you secure yourself from the fakes, and why does not Facebook delete deepfake-videos?
Not so long ago, in a framework of Spectre project, a fake-video appeared in the Net, where the figure of Mark Zuckerberg talks with the voice of Mark Zuckerberg. He states, that the billions of people are under his control and he knows all their secrets. Facebook representatives claimed that they will not delete this and other fake-videos.
I am sure, that Facebook and many other long-sighted companies, understand: sooner or later, technology will be available in open access. The task of companies like Facebook is not to protect the public from using this tool, but to educate the public. Without deleting this video, Facebook gives a possibility to see what technologies can do and to learn that it is not smart to believe in everything you see on the screen, Even if it looks like the sheer truth
There can be lies printed in the newspapers, on the TV they can lie and manipulate. The same goes here: deepfake-video is just another content format. Ways of delivering information can be different, but all of them are connected with the fact, that people are inclined to lie, leave something unsaid or exaggerate. All of us need an adequate level of scepticism relating to what we hear and see.
We want to create our own engine of an artificial voice detector, which will be able to distinguish the real speech from the synthesized. Now, we are thinking about the ways how to integrate the water-marking system into our audio tracks, so we would definitely know that they were generated and that those were us, who did it.
Regarding this field, we are communicating with the companies which create systems of voice recognition and authentication. Out data-sets can help them to teach their models to better recognize the synthesized speech.
At Respeecher, we pay a lot of attention to security — our technology is available only to us. We receive the required recordings from the customers and give them the final recording — the artificial speech is synthesized on our side.
Is it necessary to get permission to use the voice from its owner or legal representatives?
Our position is unequivocal — yes, it is necessary. From all customers, we require permission to use the voice from its owner, relatives or representatives.
Is it possible to freely use the voice of a historical character, for example, Che Guevara? I have no answer yet: this is one of the questions that we are trying to solve together with lawyers. Legislation in this area is vague, it varies greatly in terms of personal rights, but we pay a lot of attention to ensure that everything is correct and honest. In fact, the market in this segment is only emerging, and we can develop it. On the one hand, it’s cool that we are in the peloton. On the other hand, it’s not that cool: it delays the entry into the market of some projects because we are the first to understand the legal nuances of such work.
We did not have that much possibility to talk to A-list celebrities and to know their opinion about the synthesis of their voice. However, there are indicators, which implicate that they are interested in licensing their voice for such kind of technology. Finally, famous people will be able to satisfy the demand for their voice from the creators of cool content.
Imagine, that you are Morgan Freeman. You are 82 years old, and people want to record everything using your voice. From simple advertising to cool projects — films, books, games, animations. Everyone wants to get Freeman’s voice, and he is forced to choose carefully, because he physically cannot spend more than N hours a week in the studio, and something may be just corny to him. Respeecher can fix this. We remove physical restrictions on how much work a voice owner can do — his voice begins to “live” without him. Of course, the owner earns on this.
We have signed a contract with a large Hollywood studio, included in the top 5 of Hollywood. This is a very cool company, but we still can not talk about this cooperation. Together with them, we are working on a legendary film that will be released soon enough. For this project, we revive the voice of an actor who died several years ago. Not so long ago, we received data from a film studio and sent them the first results — the director of the film was impressed. We have already been sent audio tracks that should sound in the film, and now we are synthesizing the actor’s voice for them.
Also, we have a few interesting customers. For instance, a project with a large British broadcaster, for them we are reviving the voice of a famous historical character. It will clink on a radio show like he is back to our times especially for this purpose.
We are working with a “live” voice, too. With one large outsourcing company, we will soon record personalized greetings for new employees with the voice of the CEO. They have offices all around the world with 10 new employees daily — it is impossible that the CEO is able to congratulate each of them personally.
In the acquisition funnel, there are “heating” a lot of interesting projects, including animation. Video games creators need a lot of post-synchronization and it is hard for them to use famous voices — the owners of these voices are not ready to spend a lot of time in recording studios. Here our technology comes in handy — to synthesize voices of living people or historical characters.
In general, in the nearest months, we have to sign up as many as possible contracts with the companies of the market’s top segment. While the technology requires a lot of our direct involvement, the project may require a month of full-time work of several people. Therefore, we are interested in having major players as our clients: with them, we will be able to solve really important market problems and get a decent reward for this work.
When technology becomes more independent, we plan to move to wider segments of the content production market. By that time, we will already have real-time results on the engine and the first integrations with call-centres.
Talking about investments, we already have two angel cheques and 120 thousand dollars from Techstars. We are still at that stage of development when it is impossible to accurately valuate a startup, but in any case, our valuation in the next investment round will be market-based. Now we are preparing for the next round and building connections with investors.
Now our monetization model is project-based contracts. They imply the use of our solution in the framework of one project on one pair of voices. We are improving the technology in order to automate the processes and give some of them to the customer side, for their convenience. In the future, when the technology becomes more independent, it will be something like a SaaS-model for studios.
Another interesting model that we are currently exploring is insurance. A movie studio or actor can ensure the latter’s physical ability to record the right phrases at the right time. If an insured event occurs — for example, the actor is hoarse at the time when the post-voice should take place — the studio will be able to use our technology.
For the call centres, the “per user per month” model is probably suitable. For example, the operator works in Asia, and he has a noticeable accent. On each of such employee of the call-centre, we can earn a few dollars per hour, while the operators will be able to speak without an accent with America or Europe. This is an interesting market, and the value that we can bring to it is very high.
How Respeecher’s solutions are better than competitors’?
We are not competitors with startups who make the visual part — it is a complementary part of our technology. In the industry of video and animation synthesis, there are a number of awesome projects, such as Reflect, Canny AI, Syntesia, and each of them is good at something. Someone is better at making a “face transplant”, some at using a 3D model, some at using lipsing (animation of the mouth, transfer of articulation of the actor’s voice to the character — Startup Jedi).
We make the voice, and within its framework, we occupy a rather narrow niche. Namely — speech to speech, we synthesize speech from speech. The peculiarity and qualitative difference between the synthesis of “speech to speech” and the synthesis of “text to speech” is that we enable the content creator to control the emotional part of speech — the way something is said. When speech is generated simply from the text, the system can guess how to say certain phrases, but it does not know for sure. Respeecher gives you the opportunity to get the right emotional tone of speech, with all the semantic emphases and intonations.
Besides, text-to-speech-engines, most likely, will not work in real-time. Even, if they will, they will not be applicable to those markets that we are targeting on. As they will first extract text out of a person’s speech (making mistakes at this stage), and then synthesize the text into a voice. This will require significantly more time than is acceptable for a telephone conversation or other work in real-time. “Text to speech” will not be able to process an unfamiliar way of pronouncing names, geographical names that are not in the dictionaries, or simply slurred words.
How the project got in Techstars, which is qualitatively different from Y Combinator
Right now we are on the three-month acceleration program from Techstars in Philadelphia. Our way here was interesting and difficult. My acquaintance with Techstarsbegan with a meeting with several accelerator graduates in Kyiv. And then we accidentally met on Twitter with the managing partner of the London program — Eamonn Carey. A few weeks later he was on a visit to Ukraine — we met and talked.
At that moment, we have applied to Y Combinator for the second time and to two Techstars programs in Singapore and Canada. They didn’t answer for a long time and after all our application was accepted by the Techstars program in London. We made it to the final interview, in the process we met a lot of interesting people — one of them even invested in us as a business angel. However, we were refused to participate in the London program: they had a very strong batch, and there were many more mature companies than us.
In 2011 Techstars accelerator launched Techstars Network — an international cyber-community of organisations, which provide startup acceleration programs as Techstars does. Now, there are 47 Techstars programs around the globe. The majority of them are on the base of their own accelerators. They can swap the applications if a group at one “branch” is already completed, but the project is worth being noticed.
In such a way, the program executive in London swapped our application with Techstars in Philadelphia, and in a week we were on a remote interview with accelerator department in Philadelphia — they took us.
We were also in an interview at Y Combinator, and we could compare different approaches there. If at Techstars you go through a series of in-depth interviews with different people, then Y Combinator does one ten-minute interview. After it, we were refused with the wording that Y Combinator does not see in our startup an opportunity to build a billion-dollar business.
Well, it’s Y Combinator’s opinion, they have a completely different approach to learning. If Techstars takes 10–12 companies in one program, then Y Combinator takes 160 (this is the number of participating companies in the batch which we were refused to join)). They spend less time on startups, focusing more on giving startups the opportunity to attract financing using the Y Combinator brand. Techstars takes a lot of your time, you get a ton of regular feedback back. You need to be prepared for the fact that often this feedback can be quite tough, but at the same time very useful.
Another Techstars bonus — all the startups participating in the program, get access to the unique networking. It is more than 300 thousand startup founders, investors, mentors and experts all around the world. I can write to a person, who has passed the Techstars program seven years ago and sold his company for the bags of money, and in 99% of cases, I will get a quick response and help.
What is the process of acceleration in Techstars
The program is divided into several stages — each of them helps you to achieve the desired goals. The first week is an “Orientation week” during which you get to know how things work here.
After that, starts a stage called Mentor Madness. A couple of weeks during the day you meet and call up with 10+ mentors with an interval of 2 minutes between the meetings. Before each such day, you thoroughly prepare, study the people, whom you will communicate, and make up specific questions. In general, mentors already know what you are doing — before this they have expressed a desire to help startups in this particular batch.
So each project finds a number of mentors with whom it subsequently works closely. Among them you need to choose the lead mentors, they will spend a little more time with you (at least one call per week and constant access by email). It’s advisable just to keep in touch with the others. All mentors have very different backgrounds, so you can find those who can close the important “holes” in your skills. Mentors are a very valuable asset of the accelerator.
Next stage — building the acquisition funnel. All the Techstars capacities are used to get the contacts of the customers you need. This is extremely useful for us — it’s difficult to reach Hollywood studios from Ukraine on your own. This road is much simpler and shorter if you go through Techstars.
Next stage is fundraising, and currently, we are on this stage. On the “investors preview” week, the representatives of small venture funds and business angels will come to the accelerator. They can be interested in investing, but the main goal for both parties is to get quality feedback on our fundraising strategy, on our pitch and on the way we build relationships with investors this investing round.
At the same time, we are actively preparing for the demo day. There will be a lot of people on it, and right before that, you have meetings with investors who are ready to invest money in your project. And here your task is to close the round as quickly as possible, preferably immediately before the demo day or immediately after it. This is an opportunity for investors to enter projects that have not appeared in the free market yet. This is one of the small hacks, that all accelerators conducting demo days, use: investors understand that many other people will learn about the success of the project on the demo day and are in a hurry to board this train first.
P.S. Rrespeecher searches for talented deep learning engineers to join the team. Contact us at email@example.com
Author: Alex Litvin.
Translated by: Dmytro Basok.
Translation editor: Angelina Dmitruk.