If I could come up with a wish list for AI services that would be available through AWS (or Google App Engine or similar), it’d be these…
- Language Detection. Which language is being spoken by the user.
- STT. Why one isn’t available as part Lex? Beats me…
- Gender and Age Detection. To adjust the content based on this…
- Speech rate. How many words spoken per minute. Could be used to adjust the TTS rate.
- Speech volume. Is the person a loud talker?
- Prosody. Where is the person putting emphasis on words. Could be used to add prosody to a response.
The medium difficulty:
- Sentiment. The language used — is it happy / sad?
- Mood. Is the person happy or sad based on vocal tone.
- Honesty. Rather, Voice Stress Analysis. How likely is the person to be telling the truth? e.g. “I’m feeling fine.”
- Accent. Channeling Professor Higgins, can we guess where the person is from?
The super difficult:
- Health. Is the person coming down with a cold or health condition? Is the speaker tired? Low blood sugar?
- Singing Lyrics. One way to wreck STT today is to sing a command. What if we could still understand when someone is singing?
- Hidden Meaning. What are some alternatives to what the person means? “You’re off your case, Chief”
With these, developers can start putting together Turing-test passing voice interactive bots.