Whenever someone uses their voice to control their gadgets, cars, or homes, they are communicating with a Voice Interface. But how does it actually work? How do you build it if you’re a UX/UI designer? Read the post to find out.
Some time ago, being able to speak with a device or machine, seemed like the wildest of dreams. However thanks to voice interfaces, today it’s a reality. The vast world of voice interfaces is based on a much broader concept: verbal communication between man and machine.
Today we are constantly interacting with voice interfaces. We use them, for example, when we are driving, and we have to search for something on the internet or call someone (voice assistant). Our voice becomes a mouse, a finger, a vocal keyboard that allows us to dictate what to do with our view.
A device capable of interacting with the user through his voice is called the voice interface. For a voice interface to work, it should be able to implement two processes.
It is the understanding of the user’s voice by the machine. It is also known as Automatic Speech Recognition. This process is straightforward for humans but necessitates a lot of advanced technology for devices. The computer should be able to:
– Comprehend both isolated words and continuous speech: when they perceive only one word at a time or an entire sentence formed naturally and straightforwardly.
– Distinguish voices: they can recognize the sound of any person or the person who created them.
– Be in the context: they can include any question or only requests in a precise and unambiguous meaning.
The machine must respond to the user generating speech. It is called Voice Synthesis.
Also known as TTS (Text to speech), it is the conversion of a written text into an artificial voice produced by a computer. A voice signal that reproduces text verbally.
The techniques to artificially generate the voice are:
– Articulatory synthesis systems: reproduce the functioning of the phonatory system.
– The synthesis for formants: mathematical filters manage the acoustic parameters of the artificial signal.
– The concatenative combination or by fragments: it is the composition of messages starting from acoustic pieces taken from the natural voice. Once extracted, the particles are stored in a database, selected, and finally recomposed to create a sound that corresponds to the written letters.
Voice Assistants of Today
A survey conducted by TechPinions in 2016 revealed that 13% of 1300 respondents owned an Amazon Echo, the smart speaker, and used it in the home. It also emerged that Amazon Echo is mainly used for playing music but also for controlling lights and home timers. While among the most used voice interfaces by far, there is also Siri and Google voice synthesis, which are used just like voice search tools in the car (51%) and at home (39%).
Among the most used voice assistants, let’s not forget Alexa and Cortana.
The future of such devices is, most probably, in their close connection to the smart home and smart city concepts that will change the way we live for the better.
Smart homes and smart cities are places where artificial intelligence and a quality connection ensure that voice interfaces can become search tools in the archives of big data that we have available today.
How to Design a Voice Interaction
When a business needs to design a voice interface, they usually address professional UX/UI design agencies. The number of specialists in VI is relatively low. So a professional who has been working with graphical interfaces their whole life now has to figure out the laws of design for a medium that is not the same on which he has experimented and tested for years.
- To design this type of interaction, we start with a flowchart that designers should test through the voice or written communication that the machine will transcribe in bits.
- When a designer makes a dialogue flow, the designer must be able to respond to the user’s needs by understanding and foreseeing all the infinite possibilities of interaction and what a user can express to the device.
It is important to consider that no matter how much an interface understands a particular request, it is always difficult for it to understand the context of the discourse. Therefore, limits on artificial intelligence must be understood step by step.
Google Voice Recognition
Let’s take Google’s voice recognition as an example. It is since 2009 that Google has started to predict voice services using the Gaussian Mixture Model (GMM), an acoustic model combined with other techniques.
Since 2012, it started using other structures such as the LSTM RNNS (Long Short Term Memory Recurrent Neural Networks). This structure works through “discriminative training” that differentiates the phonetic units instead of modeling them independently.
In 2016, Google introduced a new machine translation system based on neural networks, replacing the old system based on statistical data for English and Mandarin Chinese.
Google also has incorporated recognition services into its products and markets them with the Google Cloud Speech API. The company had to walk a long way to present us with voice-controlled services that we all use today.
Future of Voice Interfaces
Voice interfaces are convenient and ambient-friendly: they can and should work when it’s not possible/comfortable to have manual control over the interface. Specialists still explore this area. The largest companies in the world, like Google and Amazon, are experimenting and investigating ways to make these solutions better.