AI
Why real-time voice AI is harder than it sounds
Real-time voice recognition has become so common that many of us now take it for granted. But that convenience is the product of years of deep learning research and products that yielded more frustration than results.
It turns out that simultaneous voice transcription is one of the hardest engineering problems in modern artificial intelligence, for reasons that have more to do with the foibles of human speech and our lack of tolerance for delay than with the underlying technology.
Voice is where many AI systems first break down, especially as companies rush to deploy agents in customer-facing environments, said Scott Stephenson, co-founder chief executive officer of Deepgram Inc., developer of a scalable platform for automatic speech recognition and text-to-speech capabilities delivered via an application programming interface.
Human tolerance has its limits
“It has to do with real time,” he said. “If people are working with a product that isn’t expected to work in real time, they’ll allow more failures or silent failures.”
A misfiring chatbot can be retried. A voice assistant that pauses, misunderstands or responds awkwardly annoys the user. Those latency constraints mean “you have to get everything that you need to get done in 500 milliseconds or less,” Stephenson said.
Unlike text, which is standardized, speech is variable. The same word can sound dramatically different depending on accent, age, language, microphone quality, background noise or even where the speaker is standing. Stephenson called this one of the biggest problems in building robust speech systems.
Transcription tools have been around for years, but most only worked well with perfect audio. Those rule-based speech systems were built from layered models that tended to compound errors.
“Each of the models was maybe 80% or 85% accurate,” Stephenson said. “When you stack five of those together, you get down to 50% accuracy.”
Deep learning breakthrough
The breakthrough was end-to-end deep learning, in which models trained directly on massive datasets and inferred the rules themselves.
But even strong models are only part of the equation. Enterprise voice systems must be deployed like infrastructure, and the needs of business buyers are fundamentally different from those of consumers. “It has to have low latency, it has to have high throughput, it has to be reliable, it has to be debunkable, it has to be adaptable and get better over time,” Stephenson said.
Deployment options matter too. Many enterprises want voice recognition to run in their own environments for regulatory or privacy reasons. Deepgram delivers its technology using an API-first approach, but Stephenson said the differentiator is not the interface but the ability to deliver consistent performance at scale.
Measuring quality in voice recognition is more complex than many executives assume, he said. The primary metric for speech-to-text is word error rate, or the percentage of words transcribed incorrectly. “If your word error rate is 25% or less, you can get value,” he said. But perfection is unrealistic: “There really isn’t a zero percent word error rate,” even with humans.
Voice generation is even harder to score objectively. Stephenson said it relies heavily on human preference testing with “tens or hundreds of people” across different scenarios.
The infrastructure burden is growing as voice agents increasingly rely on large language models and tool use behind the scenes. Latency is a physics problem at global scale. Real-time voice systems require regional endpoints because “the Earth is large enough that the speed of light matters,” Stephenson said. That’s why Deepgram is expanding its endpoint network to Europe this year, with Asia on deck.
Because of its inherent complexity, voice AI shouldn’t be viewed as an all-or-nothing proposition. Stephenson advised testing in a few scenarios where the lexicography is limited and expanding from there. “Don’t try to boil the ocean,” he said.
Voice recognition may be the most natural interface humans have, but making it work reliably in real time requires disciplined engineering, global infrastructure and models trained to survive the chaos of the way people speak.
Image: SiliconANGLE/Meta AI
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About SiliconANGLE Media
SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.







English (US) ·