Voice interaction can be fickle. It can vary a lot depending on our mood, the request we’re making, or the device we’re interfacing with.
One of the biggest killers of a good voice interaction is latency. I wrote about how we equate speed of response with intelligence. However, depending on how a voice interaction system is setup, every increase in latency can result in slowing down the response by 6x.
With the Echo or Alexa Voice Service-enabled device, the device streams audio to Alexa and Alexa then streams the audio response back to the device. There are two trips — one from device and one back to the device. It’s like that this is the same setup for the Google Home.
The result is that if putting the Echo behind a brick wall causes an additional 100 ms in ping time, it might increase the response time by 200 ms.* (*note this is really simplified math and doesn’t account for many other variables).
In the case of the Ubi, we had two round trips. The first was to Android Speech Recognizer for doing STT. The next was up to the Ubi Portal where we ran NLU and other rules and then the response was pushed back as text to the Ubi. We’d then use a local text-to-speech engine to process the result. In the same scenario, the 100 ms increase in latency could cause a 400 ms delay.
The probably is amplified by devices that also use cloud based text-to-speech. The result is then three round trips, so a 600 ms delay (very noticeable). This might push a response to 2–3 seconds.
The problem is also compounded by conversational devices that require multiple turns for interactions. With a long delay in each turn, conversation becomes unbearable and will break.
Ideally, the more that processes can be batched to either local or cloud based processes to reduce trips, the better. Otherwise, there are a few psychological tricks that can be used to make users more patient with responses.