That's nice, but the main problem with current voice turn-taking is different. I...

krisoft · on March 29, 2025

> Instead of the rigid pause detector, it would actually make more sense for the end-of-message token to be sent when you explicitly say a specific phrase, like literally "over".

That doesn’t sound very conversational at all. Instead one could train the network to recognise the appropriate turn-taking points.

The simple way to do that is to make the model output a “listen a bit more” token when it is not yet their turn to talk. You can use real life recorded conversations to build up the initial training set, and then add more data where clashes happen (where tha AI and the speaker speaks at the same time over each other.)

More complicated would be a system where the model is periodically fed the audio chunk so far, and the model predicts what the speaker is likely going to say and based on that when it is appropriate to respond and with wath. And then a smaller, faster, local model can be used to verify if what was said matches the prediction, and if so outputs the generated response. If there is a mismatch it engages the more expensive model to come up with a new prediction.

If you engineer this right you can reuse the state vector from save points and save a bit of compute that way.

Asking the user to say “over” at the end of their turn is the most heavy handed solution. Recognising the flow of a conversation is just pattern recognition. That is what machine learning is good at.

derf_ · on April 3, 2025

> Recognising the flow of a conversation is just pattern recognition. That is what machine learning is good at.

And surprisingly hard to do well in practice. My guess is that the problem is that there is very little information in your training dataset (because only the transition from "talking" to "done talking" matters), but the actual knowledge required to perform well is large (up to and including full speech recognition, in theory). So even with over a terabyte of training data, your choices are a small model that performs badly or a large(r) model that overfits severely.

It's possible there was something I was overlooking when I tried it, though. I couldn't think of a good way to confirm my guess experimentally.

cubefox · on March 29, 2025

The "listen a bit more" token sounds interesting, but I'm not sure whether it would actually work better than the current solution which just waits for a sufficiently long pause. Maybe both could be combined.