For voice interaction, tap and talk vs push-to-talk have very different implementations, challenges, and opportunities. The Echo, for example, employs tap and talk when someone presses on the command button. In GUIs, microphone icons usually feature tap and talk as well.
Push-to-talk usually means that a user needs to push and hold down the button while speaking. Think about an intercom system (e.g. “Dr. Green, your 3 PM is here”) or a walkie-talkie. The benefit of push-to-talk is that the system knows you’ve stopped speaking with the button has stopped being pushed. There’s no trail on and the result is that STT systems are likely to get clean audio. There’s also an intuitiveness to the user — they know when they need to speak.
Talk and talk is a little different in that the device will begin listening and then detect an endpoint in speech. For this to be a better experience to user, you need to have some type of acknowledgement of listening and some type of acknowledgement of endpoint detected. The benefit is that the user doesn’t need to physically strain to speak and also the speech can stream during the recording so that the response comes faster.
The drawback is that the timing needs to be setup properly for tap and talk to work well. Sometimes users wait too long after they tap, or they pause too much during speaking that endpoint is detected abruptly. As well, the acknowledgement of tap detected, if audio-based, can impact the performance of endpoint detection or speech to text accuracy.
As a general rule, it’s probably better to have push-to-talk in portable device that allow for a pistol grip so the thumb or index finger can be used to push the button while speaking and for stationary objects in quieter environments, tap and talk is probably better.