Transcript of Tomorrow Will Be Heard (S2E3), the podcast that deciphers the new uses of audio in our daily lives
Research and technology now make it possible to clone a voice and we’ll tell you all about it in the new episode of this podcast.
We’re checking in to the Time Hotel and Thierry Ardisson is there to welcome us and talk about this new idea for a show co-produced by Ardimages and Troisième Œil Productions : « Time Hotel aspires to revive vanished legends and let them narrate their life stories. So how to revive them? With artificial intelligence. This means that, thanks to deep learning, we can reconstruct the face of Jean Gabin, Dalida, Coluche, Johnny Halliday. And then we put these faces on actors who have embodied the character. That is to say, they must have the behavior of the person they will embody. Next, we put on the digital mask and it’s magic, it works. But the problem was the voice. Hence, our idea to go to IRCAM, which I knew about since I was a teenager, thanks to Pierre Boulez and then Jean-Michel Jarre. I didn’t know if IRCAM could do that, but in any case I knew that was where you could get good sound. Quality sound. »
Thierry Ardisson met the staff of Ircam Amplify, who proposed a unique vocal technology to him, Voice Cloning.
Frederic Amadu, CTO of Ircam Amplify: « Voice cloning is a new technology from Ircam which enables the transposition of the voice of one person onto the voice of another person. And so, for Time Hotel episode, one of the characters is Dalida, whom Thierry Ardisson wanted to revive to do an interview. But in fact, it was an actress playing the role of Dalida, and we managed to retrieve the voice of this actress and, thanks to the software, we could transform her voice into Dalida’s voice. So for this to work, we need to understand the way Dalida spoke. We recovered some recordings of Dalida’s voice from years ago. We fed it into the machine and an artificial intelligence system did an analysis of all the ways Dalida would pronounce the words. The system was also given a part of the actress’ voice so that it could also learn how the actress speaks, her voice tone, her speech patterns. And the machine learning system is able find how to how to go from one to the other. »
A major innovation that relies on IRCAM’s technology and know-how.
Demonstration: « So if we listen to Dalida’s voice, it sounds like this: “I am a performer, that is to say, I supply the dream. So I don’t really need to talk about problems which are common to everyone.” This is the portion of Dalida’s voice that was fed into the system so that it could understand the timbre of Dalida’s voice. And this is the voice of the actress: “And then in 1960, I brought my family, but you, in ’56, that was much more violent.” And then, by applying the filter we created which transforms the voice of the actress into the voice of Dalida, we end up with this version: “My father was called Pietro and my mother Giuseppina and they were from only two families from a Calabrian village, Serrastretta, who had settled in Cairo.” »
Watch the episode (English subtitles)
This innovation totally convinced the production staff.
Christophe Pinguet, Producer at Troisième Œil Productions, explains: « The sound was, I would say almost the biggest difficulty. When Thierry arrived, he had already developed his project. The images were pretty much sorted with Mac Guff, who had worked a lot and so the visual effects were extremely impressive. We still had to find a solution for the sound, either with imitators, which would be a bit classic, or a technique as innovative as the one used for the visuals with artificial intelligence. So that was a real job. I have been producing documentaries or shows for a long time – the worst nightmare, actually, is the sound. And so, when we embarked on this project with IRCAM to work on the development of a software to find real solutions, that fascinated me. I’ve been doing this job for a long time and this is the first time that I have been this fascinated by the technical side, with regard to the image as well as the sound. »
Frederic Amadu: « So the software that was developed, its advantage over what’s out there, is that it can keep the emotion, the sensitivity that an actor will give. What we take from Dalida is the timbre of her voice. We apply the timbre to a set of voices. If the actress speaks fast, we will have a result of Dalida speaking fast. If she speaks slowly, we’ll hear her speak slowly. If she’s grieving, we’ll hear her, we’ll hear the grief, and if she is joyful, we will hear the explosion of joy. So it’s the actress who has to play the emotion, she carries the emotion, the feeling, and the software adds the timbre of Dalida’s voice so that we can hear Dalida say these words. »
Technology opens up new territories of expression for the creative industries.
Thierry Ardisson: « When I saw the first deepfakes, I said to myself, “This is amazing!” Deepfakes were being used in some harmful ways, that is to say, to annoy people, to put them in inappropriate situations, let’s say, for supposedly comical purposes. And so I said is there some way we can put this tool to good use? So that’s what we did. Because actually deepfake has a very bad reputation. But deepfake is a tool and that’s why I turned to artificial intelligence to spectacularize culture, That’s the slogan of the show. ‘Using artificial intelligence to spectacularize culture’. »
As far as sounds are concerned, the research conducted by IRCAM will allow us to go even further.
Nicolas Pingnelain, Sales Director of Ircam Amplify: « What is very new in Thierry Ardisson’s concept is the running time. Until now, we had voice cloning in a rather sporadic way. There was The Book of Boba Fett with Skywalker talking. Now we’re talking about 60 minutes of running time, which must be very good, because it’s a little bit like the voice assistants today. They are monotonous and robotic and they can be tolerated over a few seconds, but not for a whole conversation. We get tired very quickly. And that was the pitfall we absolutely had to avoid in terms of what we delivered. I think we can say that we did something that can be listened to in a very fun and enjoyable way over the entire duration of the program. 90 minutes of programming for about 60 minutes of Dalida’s recreated voice, it’s completely unique. School, it was the nuns. The nuns were very strict, you couldn’t show a single inch of skin. It was a commission that made us speed up a direction of research that already existed. And today, if we compare it to where we were six months ago, our results are just extraordinary. »
Synthetic voice or Voice Cloning, a major innovation provided by Ircam Amplify with the Sound Analysis and Synthesis team of the STMS laboratory, which includes IRCAM, CNRS, Sorbonne University and the Ministry of Culture.