In a speech interactive systems, the user can issue voice commands to the system in order to alter its state.

We can categorize eight levels of speech communication, organized into three layers:

Top layer:
- Discourse: considers how single utterances combine into meaningful communication
- Pragmatic: considers the meaning of the words in context;
- Semantic: considers the meaning of the words independently;
Middle Layer
- Syntactic: considers how the single words can be combined into sentences;
- Lexical: is considered on the level of words;
Bottom Layer
- Phonemic: is considered on the level of the single sounds that are the basis of each spoken language;
- Articulatory: is from the human speech production point of view;
- Acoustic: is considered as sound;

Pros and Cons of speech interaction

Pros

The main benefit of speech recognition in HCI is the naturalness of the interaction, and the fact that spoken utterances can be very expressive, meaning that by using speech we may perform operations which may be way harder to do using other modalities.

An example is the selection of similar (e.g. color or shape) objects from a large set of objects that are scattered around the display area, or that cannot be displayed at once: by using (spoken) language homogeneous objects can be selected with a single utterance (e.g. “all the red ones”, or in a multimodal setting by pointing at a cluster and saying “that cluster of points”.

It’s also a very good type of interaction for motor-impaired users, along with Eye Tracking.

Cons

Even in modern systems, we still cannot use all human-to-human communication methods in speech interfaces, such as overlapping speech, which cause problems in human-computer interaction.

Very modern systems like GPT-4o are addressing this problem, and they allow to interrupt the virtual agent speech to issue a new command. This is better than previous systems where the agent couldn't be interrupted at all, but still not perfect.

In less modern systems, the user had to know how to speak to the systems, this is a less of a problem nowadays with LLMs that are able to interpret mostly any kind of utterance.

For non-native speakers it may be easier to remember commands that to speak naturally to the system.

Speech is a slow, temporal and serial medium:

Slow: forming and actually speaking a vocal sentence requires time;
Sequential: the listener receives the information in the order in which the speaker decides to present it, in a sequential manner. Everything that has to be presented with speech has to be linearized first (tables can be expressed with speech row-by-row or column-by-column. Nested structure like file trees or dictionaries are more confusing to express as speech)
Temporal: unlike written text which can be read even after it has been written, spoken messages must be perceived synchronously with the production, meaning that if the listener didn’t catch a word, the speaker has to repeat it.

When used in non-private environments, speech cannot be use as medium for private information. Also speech interaction in public may cause discomfort, and sometimes it can be seen as rude.

Speech is error prone and requires an explicit error correction during the interaction, this because it’s transmitted over a noisy channel.

Language is often ambiguous, since sentences may have more than one meaning. In HCI, the ambiguity of the interaction must be considered in the design phase. The system should provide Grounding and should disambiguate the sentence by taking into consideration the context or by making further questions to the user.

Memory and Prosody

According to Shneiderman, in order to improve speech recognition applications, designer must understand acoustic memory and prosody.

Acoustic Memory

In pure voice user interfaces there is no visual information available, which means that the user needs to memorize all the meaningful information, such as the actual dialogue state and the information the system has provided.

Short-term and working memory in this context are called acoustic or verbal memory.

The problem is that the part of the human brain that holds chunks of information and solves problems is also the one that supports speaking and listening (i.e. they use the same cognitive resources), meaning that working on hard problems is best done without speaking or listening to someone.

Physical activity on the other hand is handled by another part of the brain, so problem solving is compatible with routine physical activities such as waling and driving.

Note

In an experiment performed by Shneiderman in 1993, users were told to use voice commands to modify the style of the text and manipulate the document, while writing the words with the keyboard. This enabled a 12 to 30% speed up, since user kept their hands on the keyboard and avoided mouse selections. Another task was memorizing mathematical symbols and then speaking a “page down” command and retyping the symbols from memory. This was very hard for a lot of subjects since speaking the commands appeared to interfere with their retention. Mouse users on the other hand had less difficulties.

Prosody

Prosody is defined as the rhythm, stress and intonation of speech, which can give more information about the speaker, by reflecting their emotional state; the presence of irony or sarcasm; and the type of utterance, wether it’s a statement, a question or a command.

Dictation

While being useful for people which are slow keyboard typers, or for impaired people which cannot use a keyboard, dictation is not the best way to input text, since while in keyboarding the user can continue to think about how to refine their words while their fingers output an earlier version, in dictation there are more interferences between outputting their initial thought and elaborating on it (unless they are reading from a ready text).

In a system dictation should be provided as an alternative modality of text input.

Speech Applications

Let’s now dive into the different categories of speech applications.

Conventional Applications

The first speech applications were telephone-based interactive voice response systems, which used speech outputs and telephone keys for interaction to replace human operators. Afterward the telephone keys have been replaced with speech inputs where the user can give out voice commands out of a list of possible commands.

Desktop applications can embed dictation as a speech modality.
Virtual assistants present in modern devices are also a type of voice user interfaces. Modern ones are a type of conversational application, but in the early days they were mostly command based. GPS-based navigation most of the times embeds speech output and input in order for the user to receive feedback and issue command without distracting while driving.

Multilingual Applications

Translation systems are one example of multilingual speech application, where the user dictates the sentence in one language and the system uses Speech Synthesis to generate the audio of the translated sentence.

Multimodal Applications

Multimodal applications can use voice input as a supporting channel. An example is the Put-That-There system, where gesture and speech are combined. Multimodality bring robustness to the speech and helps the system avoid ambiguities.

Pervasive Applications

Pervasive applications are applications that are able to be used without visual displays and familiar interaction devices such as mouse and keyboard. (Use it without special equipment). An example of a pervasive application is a speech interface in a car.

Conversational Applications

Virtual assistants nowadays provide a conversational interaction, where the system is able to get the context from the previous inputs in order to allow follow-up questions and commands.

Still, even modern systems don’t support non-verbal cues for turn taking (gestures and eye movements) as well as acoustic cues (like pitch variations). Turn-taking is a difficult aspect to implement into a conversational application.

Usually a conversational interface uses pauses to understand when the user has completed the current sentence, but those can be ambiguous (see Utterance for more).

Grounding is more useful then ever in conversational applications, in otherwise the user may be disoriented about what the system is doing.

A conversational interface can use different initiative dialogue strategies:

System initiative dialogue strategy: the system has tenacy, meaning it’s able to take initiative to start the dialogue or to make certain actions based on the current state. This is also a particular case of incidental interaction (the user implicitly starts the interaction with the system). An example of this type of interaction is the system that asks the user “what volume should the music be?“. This type of interaction is easier for the system since most of the times the dialogue flows are limited and predictable.
User initiative dialogue strategy: the user is the one that starts the dialogue (hey siry, …). This is the preferred ways for experienced users since they will issue the command they need.
Mixed initiative dialogue strategy: sometimes the system sometimes the user can initiate the dialogue. Usually this is implemented into a user-initiative strategy with a system initiative strategy only for error handling.

It can also use different dialogue control models:

Finite state machines: also referred as state-based dialogues, the dialogue is modeled using finite-state timed automatas, which are suitable for well-structured and compact tasks. The complexity of the model increases rapidly if there are a high number of states and a lot of transitions between them. The automata represents the whole dialogue structure, and the paths represent all the possible dialogues which the system can produce.
Form-based: the purpose in a form-based dialogue is to fill necessary information as key-value pairs. Usually this has a mixed-initiative strategy, where the user starts by saying some part of information and the system asks other questions in order to get the missing information. The dialogue could also not be fixed, but the information required usually is.
Event-based: in an event-based dialogue the system determines what to do on the basis of the interpreted user utterances and the system state. The dialogue here is very flexible and the user has more control.
Plan-based: in a plan-based dialogue, the system focuses on the interpretation of the utterance and on the intention that it conveys. The information is then matched to a plan which models the dialogue. In this approach both user and computer utterances are sen as communicative acts which are chained together to achieve the goal of the dialogue.

Non-speech audio

Non-speech audio feedbacks are an efficient way to convey information in speech applications, specially in auditory-only interfaces where the only output channel is audio.

Examples of elements of non-speech audio feedbacks are:

Auditory icons: they are auditory feedback which meaning is derived from their original acoustic properties. Examples are sirens, alarms or sounds related to closing and opening of things.
Earcons: similar to auditory icons, but their origin doesn’t come from a real world sound. They are abstract sounds which the user has to learn the meaning of. Once learned, they can be easily recognized.
Music: music an be used as a feedback similarly to the previous elements.

tags: multimodal-interaction

Quartz 4

Explorer

Speech Interaction

Pros and Cons of speech interaction

Pros

Cons

Memory and Prosody

Acoustic Memory

Prosody

Dictation

Speech Applications

Conventional Applications

Multilingual Applications

Multimodal Applications

Pervasive Applications

Conversational Applications

Non-speech audio

Graph View

Table of Contents

Backlinks