Unsettlingly Realistic AI Voice Demo Evokes Astonishment and Unease Online

An example argument with Sesame’s CSM created by Gavin Purcell.
Gavin Purcell, who is one of the hosts on the AI for Humans podcast, shared a fascinating video on Reddit. The video shows a human acting as an embezzler, engaging in a conversation with an AI that acts as their boss. This interaction is so realistic that it’s hard to distinguish between the human and the AI. Based on our own tests, this AI system, called Sesame’s Conversational Speech Model (CSM), truly has impressive capabilities.
The Technology Behind Sesame’s CSM
Sesame’s CSM creates its realistic speech through a combination of two AI models that work together. These models are grounded in Meta’s Llama architecture, which means they combine both text and audio data effectively. Sesame has developed three different sizes of AI models, with the largest boasting a staggering 8.3 billion parameters. This large model consists of an 8 billion parameter backbone plus a 300 million parameter decoder. To train these models, they used around 1 million hours of mostly English audio recordings.
A New Approach to Speech Generation
Unlike many older text-to-speech systems that work in a two-step process, Sesame’s CSM takes a more modern approach. Traditional systems create high-level speech representations first and then fine-tune the audio details in a second step. In contrast, Sesame’s model integrates both tasks into a single, advanced system. This allows for simultaneous processing of textual and audio information, leading to more natural-sounding speech. An example of a similarly constructed system is OpenAI’s own voice model, which also uses this combined method.
Testing Performance: How CSM Stacks Up Against Human Voices
In tests that involved human evaluators listening to speech samples both from the CSM model and real human speakers, the results were pretty close. When evaluators did not have any context for the conversations, there was no clear favorite between the AI-generated speech and human recordings. This indicates that Sesame’s CSM can produce speech that is nearly indistinguishable from human voices.
However, when evaluators had context for the conversations, they still preferred real human voices over the AI-generated ones. This shows that while the technology is advancing quickly, it still has room for improvement in creating contextual conversation that feels naturally human.
Challenges and Future Improvements
Brendan Iribe, co-founder of Sesame, has openly discussed the current limitations of the CSM. He mentioned in a comment on Hacker News that the AI can sometimes be overly eager and might not always deliver the right tone, pacing, and style necessary for genuine conversation. There are problems related to interruptions and how smoothly conversations flow. He remarked, “Right now, we’re in a tough spot, but we’re hopeful we can improve.”
Potential Applications of Sesame’s CSM
The capabilities of CSM hold exciting possibilities across various industries. Here are some potential uses:
- Customer Service: Businesses can use realistic AI voices to handle customer inquiries, providing quick and efficient service.
- Content Creation: Creators can leverage CSM for audiobooks or narrations, reaching audiences who prefer listening over reading.
- Games and Virtual Reality: Game developers could enhance player experiences by integrating conversational AI that responds naturally to player actions.
- Personal Assistants: More human-like AI voices can make personal assistants feel more relatable and engaging.
As this technology continues to develop, it could reshape how we interact with AI, making it feel more like a conversation with a person rather than a machine.