Using speech-to-speech models in conversational chatbots. Rob Pickering.

Rob is a TADSummit and TADHack regular. He provides in my opinion a critical litmus test on whether a new communications technology has potential, the issues to consider, and whether its ready. At the end of this post I include the pitch Rob gave for his hack RTCEmergency at the first TADHack in 2014. Time flies when you’re having fun! His slides for this presentation are available here.

Last year Rob presented, LLMs on the telephone: useful tool, or hallucinating danger to humanity? VideoSlides.

He also presented at TADSummit 2014 in Istanbul. Then in 2018 on “The Emerging Dichotomy, Centralized versus Decentralized Communications. What is Means to Your Business.” and in 2019 on “Will machines ever converse authentically?

Rob is covering the speech to speech agents, which avoid TTS (Text to Speech) and S2T (Speech to Text). Enabling the LLM to converse directly without bringing text into model. There’s interest in having natural sounding agents, the potential acceleration in performance, but also worries on lock-in.

Rob’s been tracking conversational AI since 2019, see below. Experimenting with the different models, and year on year produces excellent demonstrations of the leaps and bounds happening in voice AI.

With the text based models there were a number if issues, such as turn prediction. In text there’s a return, so its quite easy. With voice, is it a pause for effect or a turn in conversation. Latency was also an issue given the pipeline of the LLM, TTS, and S2T. Amongst many other issues, hence the interest in speech to speech.

Rob then shows a demo given in June 2024, at EMF, they were still using the text model, and faking responses to hide latency. It had reached the good enough status.

Then Rob demos the Ultravox model, the response is almost immediate. A big leap forward through speech tokenization. That was first demoed in September at a conference in Poland. Now Ultravox is open source, so the industry is not trapped into OpenAI only.

Then in October OpenAI released its speech to speech model (GPT-4o Realtime), after some delay, given it was announced in the Spring update. However the pricing remains an issue 6c per min to receive, 24c per min to send. We’ll hear from Lyle of Vida on his pricing experiences, as currently the charging ends up even higher than published! This is likely to limit usage and nowhere near the finalized pricing.

Ultravox and GPT-4o model both performed well.

There is a fundamental business model issue with speech to speech. The AI provider goes from being an entirely stateless provisioner of GPU FLOPS anywhere on the planet to someone that has per client state and Mbit/s of full time streaming comms per client. Do they really want to be in that business?

Rob has updated his testing framework given all the incremental developments this year. Just go to https://github.com/aplisay/llm-agent to download the latest version.

In conclusion:

  • Highly likely that multimedia approaches will win out and deliver a much better experience long term.
  • This will always be more expensive (but maybe not by so much) and offer less control than a discrete component pipeline approach.
  • There is a chance of market capture when the AI provider has to build so much of the solution so maintaining a healthy ecosystem is going to be important.
  • Rob will continue to build agnostic architectures.

2 thoughts on “Using speech-to-speech models in conversational chatbots. Rob Pickering.”

Leave a Reply

Your email address will not be published. Required fields are marked *