Why do we need multimodal interaction?

With multimodal interaction we can have redundancy, meaning that we send the exact message through two or more different channels in order to enhance the probability that the information is captured. An example of this is the vibration and sound when receiving a phone call.

Tip

Sometimes the translation from one channel to another is difficult, such as from spoken language to sign language, since a single sign carries a lot of information.

When designing a multimodal interaction system, we need to consider:

The user: if they lack some senses, we cannot use modalities that requires them;
The application: which functionalities we want to implement
The environment: is it safe for the user and for the user privacy to use those modalities in the environment the user will use the application in?
The equipment: what hardware equipment do we need to capture the signal for the particular modalities?

Pros and Cons of multimodal systems

Pros

The main pro of a multimodal interaction system is the usability, since those systems are most of the times more natural and easy to use for people that are not very keen with traditional types of computer interaction;
The redundancy of information is also another pro, since it can reduce the probability that the user looses a certain information by carrying it over multiple channels;
The interaction is more robust, since the weakness of one modality is offset by the strengths of another;
It can also accomodate a wide range of users, tasks and environments for which a single modality may not be sufficient.

Cons

The two main problems of multimodal interaction are synchronization and integration.

Synchronization: it’s difficult to synchronize the different tracks of interaction where different modalities are used. (e.g. in the case of the put-that-there system, I have to synchronize the gesture that indicates the objet to put, and the voice that says that indicating the object, and there indicating the desired location)
Integration: it’s difficult to integrate more modal technologies in the most natural way possible, in order for the person to feel the multimodal interaction as natural as possible. The difficult thing here is to understand what is the “natural” way of doing things. The task is even more difficult if the application supports collaborative work.

Note that synchronization is different from simultaneity, which means to convey different channels of interaction all at the same time. In sequential communication, on the other hand, a channel is separated in time from another channel, meaning they follow in sequential order.

Relations between modalities

todo maybe delete this, since it’s very similar to Time relationships and Cooperation between modalities

We can have different types of relations between the modalities:

In complementarity relation of the modalities, each channel is essential for the communication, since each modality carries a piece of information, and the system needs to combine them all in order to interpret the user intentions. Because of this, this is the most difficult system to implement.
Another type of relation is addition, in which there is a primary modality and secondary modality which enrich the information carried by the primary modality. Those channels are called the backchannels. An example is a person that while talks gesticulates. The gesticulation is a backchannel. Another cool example is prosody, which is the accent, the tone and the pitch of the voice when a person speaks.
Redundancy means carrying the same information across multiple channels, so that if the user misses the information carried in a channel, it can “grab” the information across the other channels.
Elaboration is a more exotic relation, which means to express part of the same information with different modalities.
Alternative means that we can express the same information in different channels, just like in redundancy, but alternating one channel at the time. An example is the smartphone keyboard that accepts either mic input or touch input, but one at the time.
Stand-in and substitution try to substitute a modality with another modality that better express the message.
Conflict is a modality that has to be avoided, which is when two different channels are carried simultaneously and they use the same resources. An example is a video with subtitles. In order to read the subtitles I cannot look at what’s happening in the video.

tags: multimodal-interaction

Quartz 4

Explorer

Why do we need multimodal interaction?

Pros and Cons of multimodal systems

Pros

Cons

Relations between modalities

Graph View

Table of Contents

Backlinks