Return to site

Meshing with AI

How do we get along better with the devices we speak to?

One of the terms that came up often last week at the Conversational Interactions conference was prosody — the able to add inflections and rhythms to voice.

Newer Text-To-Speech (TTS) services are adding this to their services. For Chinese language TTS’s, it’s a requirement. This is why even earlier Chinese TTS services seem more realistic. For English TTS services, prosody had been neglected for some time. The voices were flat and often robotic sounding.

We’re finally starting to hear TTS’s that are more realistic, including those based on Wavenet.

At least for voice-only interaction, there are a couple of cheats that we can use to make our devices more relatable:

  1. Match accent. We think only others have accents while we’re the ones who don’t have one. Matching accents is the biggest win.
  2. Match cadence. If we adjust the TTS word rate based on that of the speaker, it can also help us make the device more relatable. Less likely that the user will be thinking the device is speaking too quickly or too slowly.
  3. Match emphasis. This can be done by slowing down TTS word rates for particular words or adding pauses.

Until prosody and emotion are readily available in TTS, these fixes can mimic this. Other possibilities are to use color to express different emotions, such as red/orange/yellow for angry, blue for calm, and purple for liking or agreement.

These small tweaks can affect a deeper engagement with users.

All Posts

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!