When the voice of our connected devices become emotional
Special IRCAM Amplify forum 21 edition – chapter 4/5
Speakers
– Luc Julia, Chief Scientific Officer – Renault Group
– Jérôme Monceaux, CEO – Spoon.ai
AI will never replace humans, but they will be better than us in so many specific areas. We will talk with AI and they will answer us.
Luc Julia, Chief Scientific Officer – Renault Group (co-creator de Siri)
We are defined by our voices. SpongeBob is no longer SpongeBob if we change his voice. If we work well enough on our character, people will get used to his voice. There is no need to create a human voice for that.
Jérôme Monceaux, CEO – Spoon.ai
Here comes the sound:
How can AI understand emotions and thoughts by the sound?
We are in a world where relationships between humans and connected devices are vast. These links are increasingly made by vocal and sound interfaces. Today, we are already talking with our smartphones, cars, speakers, computers… Tomorrow, sounds and voices will be at the center of our relationship with every technology. The IoT is rising very fast, led by vocal assistants such as Alexa, Google or Siri. Speech synthesis (TTS) is for example way more realistic and efficient, today than ever. A lot of huge companies such as Samsung, Renault, or Spoon are already looking at expression and integration. The question is: how to make robots and connected devices understand the tone of voices and the emotions? And make them able to find lifelike and appropriate answers.
When prosody rhymes with IoT
To change our tone of voice to make others understand our emotions or intentions. We are able to do it from the very beginning of humanity, as Frédéric Amadu, CTO at IRCAM amplify, says : “In the IoT, everything is made by the voice, it’s a return to ancestral human traditions. With the machines, we started with buttons, then movement, and now the voice.” For him, the key of that voice based relationship is prosody : “To play with frequencies to share empathy, sadness, anger… To lower the voice or slow down its rhythm… That is prosody.”
“Everything starts with prosodic variations”, Jérôme Monceaux, CEO at Spoon, says. “We want AI to be able to slow down its own voice and play with its tone to be more efficient and pedagogic. That is our major goal !” To give relief of the speech synthesis will be game-changing. “People won’t listen very well without prosodic variations as humans can do.”
More creativity, less humanity
To give the sensation of emotions, yes. To try to create AI like us, no. “We have to create new speech synthesis and innovate”, Jérôme Monceaux says. “Our character and its voice need to be caricaturized. We are inventing a new species with its own characteristics which will need to be discovered.” But the specialist wants to push his idea way further : “We can speak without a voice or words, we can look, touch, point… It’s very multimodal.” Speak without words… What is it all about ? “I talk about sound signatures. People don’t need a human-like answer, or voices, a simple sound is efficient enough.”
AI and robots will soon be able to understand the variation of emotions is one thing. Will Asimov’s ideas be real in the future ? “AI and robots will never replace humans”, Luc Julia, co-founder of Siri, now at Renault Group, says. “But I think that a lot of little AI will be better than us in very specific topics.” For that purpose, the most difficult problem is how AI can understand a word. Not just recognize it, but really understand the meaning. “I’m sure AI will understand us very soon, and adapt its answers, based on its perception of our emotions”, he says. “We will be able to do spectacular things in very specific domains. Machines will make you believe they understand you for real, even if it’s not true.”
IRCAM Amplify technologies:
Prosody is our playing field
At IRCAM Amplify, we have specific technologies and expertises for IoT. We are very pleased to lean on the work and studies of more than 100 scientists from IRCAM. These skills about sound and voice come from the beginning of IRCAM, forty years ago. “Today, we are not trying to detect or transcript words, but put emotions in a synthesized voice”, says Frédéric Amadu, CTO of IRCAM Amplify.
A first try has been made with an experience created by IRCAM Amplify
“Instead of getting only focused on words, we are focusing on intentions and emotions.” Our goal in robotics and IoT is to transform our great prosodic skills and adapt it to our clients and partners needs. “We are not doing speech synthesis, our clients and partners do. It’s a question of collaboration with all the actors in that field.”
Contact us to learn more about these technologies.
Want to learn more about the power of sound?
Watch the Forum on the Power of Sound in Industry 2022, to better understand the new uses of audio for a shared world.
Not that hidden figure…
This article is from the newsletter The Whispers of Sound, subscribe to learn more about the power of sound.