RJ has been in the ‘Voice AI’ industry for decades. The session is not about his company Consig; rather his opinions on the voice AI industry, on where it’s been, where it is, and where its going. This is a dynamic time in the communications industry, as the English ironic curse goes, “may you live in interesting times.” That’s for sure, check out Karel’s Voxist session on just how fast AI technology is developing.
RJ had a non traditional childhood. Raised in a hippy community, Santa Cruz, CA. He was homeschooled and has never spent a day in a classroom. At the age of 12 he’s discovered computers in the Santa Cruz public library, and it’s been his passion ever since.
Back in 1993 he was able to talk to a Mac, PlainTalk. RJ righly points out the ’90s was a terrible time for voice control. Nothing ever worked well. It’s even true 20 years later with Siri and Alexa.
Back in 1997 RJ got his first job working on Charles Schwab’s speech recognition system using Nuance and IBM. It was early days, it didn’t work, and this first generation system was an example of directed dialog. Remember those days? When a key press was recognized better than your voice. It was all flowchart based, so really just a slower and more annoying IVR (Interactive Voice Response).
This led to the creation of VoXML standards, a new phase of voice control, and the creation of Voxeo, from which Tropo emerged. A competitor to Twilio that was bought by Cisco.
Nevertheless, voice control over the web sucked. But Voxeo’s competitor TellMe (bought by Microsoft) focused on the 7 digit customers like America Express. With those budgets voice interfaces with careful user interface design could work.
One of the projects RJ worked on was for MTV with a Beavis and Butthead service. It has 200 stock prompts when it did not understand what was said. Which didn’t matter, users assumed it was part of the service. Today one of the tricks is using filler works of phrases to cover processing delays of speech to text based systems. I’m skipping over the intent based models, so we can focus on the language model phase.
This is what brought RJ back to the industry as language models enable small and medium sized businesses to use them, budgets in 3 or 4 digits. Not 7 digits.
We’re still in the early days of voice AI UI, for example using chat. OpenAI canvas is beginning the expansion into collaboration. RJ drew the analogy to the early days of the iPhone, where the desktop principles were applied to the mobile UI. It took a couple of years to break that legacy thinking. And that’s where we are today.
I then asked RJ about speech to speech, and he highlighted we’re just entering a new phase of full duplex. Voice control to date has always been request response. Barge-in, and turn of conversation are all being worked out at the moment, As the interface moves to conversations and natural language, there are many uncertainties. See Rob Pickering’s recommendations on speech to speech models.
Remember currently on OpenAI, voice and text are separate modes. They do not yet work together. And video will be added in time. Its early days. Also in speech mode, there are no guardrails, as the text rails are not applicable. New rails will need to be added.
RJ sees a more important challenge to be address, that is multi user mode across multiple people and agents. I see this being discussed in applications like Industry 4.0 and vCons.
As the speed of innovation calms down, and the focus moves to scale, latency, etc. That will be the time to revisit standards, as the patterns become more broadly adopted. Focus moves more onto cost, but the role standardization plays is not yet clear.
Holly Depies joins the conversation and there is an interesting discussion on the role AI can play on Virtual Compliance Officer. Compliance becomes a during the fact action, not after the fact.
Consig is focused on building voice AI agents for SMB can use. They are building out some initial use cases. And as every Voice AI presentation has said, it’s important to remain flexible as the technology is moving fast.