Cloud Computing

ANALYSIS

Rethinking the User Interface for Consumer Voice Tech

Voice can provide a simple, compelling user experience, but the path to adding voice controls to any product, service, or application is complex. As dominant tech players continue to develop voice-enabled interfaces and assistants, product designers, developers, and manufacturers will be forced to rethink the user experience and user interface.

With the incredible expansion of smart speaker adoption and consumers’ tendency to purchase smart home devices as point solutions rather than as a system, many homes in the future will have a distributed intelligence platform with voice control acting as the primary user interface.

In early 2019, 36 percent of US broadband households owned at least one smart speaker with a voice assistant.

Smart Speakers with Voice Assistant Ownership, 2016 - 2019

Voice assistant technology relies on two main components: hardware, which communicates and captures commands, and software, which thinks and processes responses. While hardware and software decisions are important, considerations of other factors — such as local versus cloud processing and power consumption — can also significantly impact the success of a voice-first application or device.

Hardware Design

Designing for voice requires manufacturers to evaluate their end product and make decisions regarding the context of use, the environment in which the device will be used, and the consumer interaction model. These decisions impact hardware choices.

For instance, an assessment of the device’s environment in terms of spatial awareness, potential noise levels in the room, and the user’s proximity to the device when speaking may lead to the implementation of more or fewer microphones.

To enable voice recognition, a device must be Internet-connected and include a microphone and a speaker. Other components include analog-to-digital converters (ADC), digital signal processors (DSP), and digital-to-analog converters (DAC).

During the input stage, when a user speaks to a device, the microphone captures the phrase and sends it to an ADC, which converts the voice input into digital audio data. Microphones may be analog or digital. Analog microphones must be paired with an analog-digital converter, while digital microphones have one built in.

The design of a microphone array depends on the device’s environment. For those that require the user to speak close to the device, one to two microphones are ideal. Far-field communication may require a four-to-seven microphone array.

After the input stage comes the processing stage. The digital signal processor feeds the data to the network module and natural language processing engine. During this stage, algorithms are instigated over the captured voice data.

Beamforming, dynamic range compression, and adaptive spectral noise reduction are examples of algorithms that help improve the quality of the voice data captured. Upon completion of the processing stage, the data is sent to the digital-analog converter and amplifier for output to the user.

Software Requirements

The building blocks for creating the software infrastructure for voice-first technology include natural language processing, which includes automatic speech recognition (ASR) and natural language understanding (NLU); wake word algorithms to initiate the voice response process; and a cloud platform to process the data.

The wake word serves as the gateway between the user and the voice assistant. The wake word engine is an algorithm that activates a device’s voice interface by monitoring audio signals to detect a specific word of interest.

Once a predetermined trigger word or phrase is detected, the voice query is sent to the cloud for processing. Generally, this technology runs locally on the device to improve latency in voice query response and safeguard privacy.

Natural Language Processing (NLP) is a form of artificial intelligence that enables human-machine interaction using natural dialogue through text, voice, or both. Chatbots generally refer to text-based dialog systems, whereas voicebots refer to voice-first assistants like Alexa or Google Assistant.

In a simplified NLP architecture, automated speech recognition (ASR) identifies words that are spoken and converts them to text (speech-to-text).

Local vs. Cloud Processing

Companies seeking to design for voice-first technology must decide how their voice assistant will process voice queries — whether in the cloud or locally on a device. Response speed, connection to the Internet, and security all factor into the decision.

DSP Group, a voice chipmaker, has found that it is feasible to implement a certain number of simple commands on fairly low-end processors or DSP chips. It has been found that the sweet spot for the number of simple commands locally falls at five to 10 commands.

These commands include tasks such as turning a device on and off and lowering and increasing the volume. Once the number of commands increases beyond 10 to 15, the need for more memory and processing power and the risk of higher fault detection rates increase substantially.

This indicates the shift to cloud processing. More complex commands are sent to the cloud due to the need for more power and flexibility, while a limited subset of commands can be interpreted locally.

Privacy concerns with always-on listening devices are a key barrier to the adoption of voice-first devices. Furthermore, consumers harbor little trust in device manufacturers to access and manage their personal data.

Power Inputs/Consumption

Manufacturers must consider the power consumption of processors running algorithms for natural language processing. Devices that lack a dedicated power source benefit from low-energy solutions. A power-aware design for always-on listening features associated with voice-enabled devices is key to power optimization.

Current smart speakers have dedicated AC power due to the energy consumption of always-listening technology. Companies may opt for battery power over AC power for a number of reasons, such as the physical placement of the device and the luxury of freedom of device placement in a room.

Aesthetics also may be a factor in removing power cords from devices, particularly for devices that were historically battery-powered before implementing voice recognition technology.

Voice-enabled TV remotes are battery-powered devices that require consumers to change the batteries every three to four months. Some companies, such as Comcast, have opted for the push-to-talk feature instead of hands-free voice remotes to lengthen battery life.

Power consumption can be approached in various ways. Reducing power use may be implemented through the use of unique wake word technology, the number of voice commands integrated, and algorithms initiated on a device.

As the consumer electronics industry continues to explore voice interfaces in smaller devices and form factors, demand for ultra-efficient and low-power solutions will increase.

As smart home device ownership increases, with owners often having multiple devices, voice as a centralized user interface for the home will grow in importance. Interoperability serves as a driving factor. Voice will become a key interface to alleviate smart home complexity and fragmentation.

Dina Abdelrazik, senior analyst, leads Parks Associates residential security research and is a member of the connected home team.

Leave a Comment

Please sign in to post or reply to a comment. New users create a free account.

What's your outlook for the business climate in 2025?
Loading ... Loading ...

E-Commerce Times Channels