That's nice, but the main problem with current voice turn-taking is different. It's that these systems don't know when it is their turn to speak.
When a human speaks to another, the second person will listen and interpret and guess when the first person is finished talking. For voice agents it doesn't work that way at all.
The text-to-speech system just seems to have a hardcoded "pause" detector, e.g. 2 seconds, and if 2 seconds of silence are ever detected, the "end of message" token is sent and the LLM will start talking. Even if you were just collecting your thoughts and weren't finished at all.
So the semantic content of what you are saying is completely ignored for turn-taking and no analysis takes place which would determine whether the user is likely to have said everything they wanted to say.
Instead of the rigid pause detector, it would actually make more sense for the end-of-message token to be sent when you explicitly say a specific phrase, like literally "over". Which was of course common in half-duplex radio where only one person could transmit. LLMs are half-duplex too: they can't listen and talk at the same time.
> Instead of the rigid pause detector, it would actually make more sense for the end-of-message token to be sent when you explicitly say a specific phrase, like literally "over".
That doesn’t sound very conversational at all. Instead one could train the network to recognise the appropriate turn-taking points.
The simple way to do that is to make the model output a “listen a bit more” token when it is not yet their turn to talk. You can use real life recorded conversations to build up the initial training set, and then add more data where clashes happen (where tha AI and the speaker speaks at the same time over each other.)
More complicated would be a system where the model is periodically fed the audio chunk so far, and the model predicts what the speaker is likely going to say and based on that when it is appropriate to respond and with wath. And then a smaller, faster, local model can be used to verify if what was said matches the prediction, and if so outputs the generated response. If there is a mismatch it engages the more expensive model to come up with a new prediction.
If you engineer this right you can reuse the state vector from save points and save a bit of compute that way.
Asking the user to say “over” at the end of their turn is the most heavy handed solution. Recognising the flow of a conversation is just pattern recognition. That is what machine learning is good at.
> Recognising the flow of a conversation is just pattern recognition. That is what machine learning is good at.
And surprisingly hard to do well in practice. My guess is that the problem is that there is very little information in your training dataset (because only the transition from "talking" to "done talking" matters), but the actual knowledge required to perform well is large (up to and including full speech recognition, in theory). So even with over a terabyte of training data, your choices are a small model that performs badly or a large(r) model that overfits severely.
It's possible there was something I was overlooking when I tried it, though. I couldn't think of a good way to confirm my guess experimentally.
The "listen a bit more" token sounds interesting, but I'm not sure whether it would actually work better than the current solution which just waits for a sufficiently long pause. Maybe both could be combined.
When a human speaks to another, the second person will listen and interpret and guess when the first person is finished talking. For voice agents it doesn't work that way at all.
The text-to-speech system just seems to have a hardcoded "pause" detector, e.g. 2 seconds, and if 2 seconds of silence are ever detected, the "end of message" token is sent and the LLM will start talking. Even if you were just collecting your thoughts and weren't finished at all.
So the semantic content of what you are saying is completely ignored for turn-taking and no analysis takes place which would determine whether the user is likely to have said everything they wanted to say.
Instead of the rigid pause detector, it would actually make more sense for the end-of-message token to be sent when you explicitly say a specific phrase, like literally "over". Which was of course common in half-duplex radio where only one person could transmit. LLMs are half-duplex too: they can't listen and talk at the same time.