How do digital voice assistants (e.g. Alexa, Siri) work?

Digital voice assistants offer the convenience of voice-enabled features, including playing music, checking the news, ordering a pizza and more. But how do these devices actually work?

October 17, 2017
• by
Vivek Sharma

Digital Voice assistants (VI) like Amazon Alexa, Apple Siri, Google Now and Microsoft Contana, are the new frontiers of man-machine interface. Alexa is the brain that enables Amazon Echo answer commands like “what’s the weather today?” to “switch on the light” to “order me a pizza”. According to eMarketer, over 35 million Americans use Digital VI at least once a month, and a quarter of all digital searches are done through voice. This blogpost aims to demystify the working of Digital VI into 3 core steps – speech to text, text to intent and intent to action.  

The first step, speech to text, essentially converts voice command to a text input that your computer or smartphone gets from typing. Good ‘speech to text’ software like Apple Dictation, Google Docs voice typing and Dragon naturally speaking adjust for ambient noise and variation in voice tone/pitch/accent to provide accurate translation in multiple languages. ScienceLine explains how the software works:

“The software breaks your speech down into tiny, recognizable parts called phonemes — there are only 44 of them in the English language. It’s the order, combination and context of these phonemes that allows the sophisticated audio analysis software to figure out what exactly you’re saying … For words that are pronounced the same way, such as eight and ate, the software analyzes the context and syntax of the sentence to figure out the best text match for the word you spoke. In its database, the software then matches the analyzed words with the text that best matches the words you spoke.”

The second step, text to intent, interprets what exactly does the user mean. For example, if you say “tell me about Paris” in a conversational context, what should the Digital VI interpret as your real intent? Are you asking for latest news about Paris, or flight options to Paris, or current weather in Paris, or news stories about Paris Hilton? Web search engines solve this challenge by ranking answers to the ‘query’ in decreasing order of inferred intent. For Digital VI, the bar is higher as it has to abstract intent from a conversational input, and then respond with one best answer. A good example is IBM DeepQA, which beat 74-time straight Jeopardy winner Ken Jennings, using ‘natural language processing’:

“First up, DeepQA works out what the question is asking, then works out some possible answers based on the information it has to hand, creating a thread for each. Every thread uses hundreds of algorithms to study the evidence, looking at factors including what the information says, what type of information it is, its reliability, and how likely it is to be relevant, then creating an individual weighting based on what Watson has previously learned about how likely they are to be right. It then generated a ranked list of answers, with evidence for each of its options. The information that DeepQA would eventually be able to query for Jeopardy was 200 million pages of information, from a variety of sources.”

The final step, intent to action, aims to fulfil the user’s need. Most Digital VIs are evolving from answering simple questions (e.g. weather) to doing things as they get integrated into cars, refrigerators, thermostats, light bulbs and doorlocks. BBC recently launched “Inspection Chamber”, an audio drama that uses Digital VI to co-write fiction with listeners:

“The listener hears from three separate characters: Dave, the female robot, and two scientists, Kay and Joseph. The two scientists, who may be aliens, have to correctly identify a new life form, the listener, before they can go home … The listener undergoes a scientific examination, answering questions like, “Do you feel special?” and “Are you in a happy mood or a gloomy mood?” … The story runs about 20 minutes [and] there are three variations on the ending depending on listeners’ answers.”

These 3 core capabilities of Digital VI not only get better with more data, but are also available as API from multiple providers. Businesses can leverage that modularity and pick & choose the best options in build an integrated solution for their customers.