User Interface Principles For Multimodal Interaction

T. V. Raman

IBM Research

Abstract

Successful user interface paradigms owe their consistency and resulting intuitive behavior to well-understood design patterns and metaphors that can be articulated independently of any single application. As an example, concepts such as drag and drop and WIMP (Windows, Icons, Menus and Pointer) span all modern graphical user interfaces. These common metaphors enable users to apply what they learn in using a given application to perform a large variety of tasks, and are founded on a set of user interface design principles.

With the coming of age of speech interaction, it is now time to integrate spoken interaction as a first-class modality for creating multimodal user interaction. Such multimodal interaction is especially relevant in the context of mobile devices where traditional user interface peripherals may not be available or appropriate. Rich multimodal interfaces that integrate new user interaction modalities will need to be based on a set of user interface metaphors that leverage the presence of these new forms of interaction. These metaphors will manifest themselves within application and user interface frameworks where they play the role of integrating user input arriving from different modalities, and in synthesizing output to multiple media. This position paper outlines a set of user interface design principles as a precursor to arriving as such user interface metaphors for multimodal interaction.


Table of Contents

Multiple Modalities Need To Be Synchronized
Multimodal Interaction Should Degrade Gracefully
Multiple Modalities Should Share A Common Interaction State
Multimodal Interfaces Should Be Predictable
Multimodal Interfaces Should Adapt To Users Environment
A. Biographical Information

Multiple Modalities Need To Be Synchronized

Spoken interaction is highly temporal, whereas visual interaction is spatial. When combining these modes of interaction in a multimodal interface, synchronization is a key feature that determines overall usability of the interface. Synchronization is key to the following multimodal acts:

Point And Talk

User points at a location on the map while speaking a question.

Redundant Confirmation

The user interface supplements visual output with a spoken confirmation; for example, a travel reservation system might visually highlight the user's selection while speaking an utterance of the form:

Leaving from San Francisco

Unless synchronized, such supplementary use of modalities can significantly increase the cognitive load experienced by the end-user and prove a source of confusion.

Parallel Communication

The user interaction leverages the availability of multiple streams of output to increase the band-width of communication. For example, a travel reservation system might visually present a list of available flights and speak a prompt of the form:

There are 7 flights that match your request, and the flight at 8:30am appears to be the most convenient.

To be effective, this form of complementary use of multiple modalities needs to be well-synchronized with respect to the underlying interaction state.

Multimodal Interaction Should Degrade Gracefully

Human interaction degrades gracefully; for example, a face-to-face conversation degrades gracefully in that it still remains effective when one of the participants in the conversation is functionally blind, e.g., when talking over a telephone. This form of graceful degradation is due to the high level of redundancy in human communication. As man-machine interfaces come to include multimodal interaction, we need to ensure that these interfaces degrade gracefully in a manner akin to human conversation. Such graceful degradation is important since the user's needs and abilities can change over time e.g., a user with a multimodal device moving between a noisy environment where spoken interaction fails and an eyes-free environment where visual interaction is unavailable.

Supplementary Modalities

The use of multiple modalities to supplement one another leads to user interfaces that degrade gracefully.

Complementary Modalities

Portions of the interface that use multiple modalities to complement one another are natural points where the interface will fail to degrade gracefully. When complementary modalities are used, the underlying system needs to be aware of the modalities that are currently available and ensure that all essential items of information are conveyed to the user. Ensuring the above is a key accessibility requirement in ensuring that the user interface is usable by individuals with different needs and abilities.

Changing Capabilities

Capabilities can change rapidly in the case of mobile users. Such changes include available band-width between the mobile device and the network, as well as changes in the band-width of communication between device and user. To be useful, multimodal interaction that is deployed to mobile devices need to adapt gracefully to such changes.

Multiple Modalities Should Share A Common Interaction State

Successful task completion during a conversation requires that the participants share a common mental model of the conversation, and this is true in the case of man-machine interaction as well. When using multiple modalities in a user interface, it is important that the various modes of interaction affect and share a common interaction state that is used to update the presentation in the various available output media. Such a common interaction state is also essential rapid completion of the conversation, since the various multimodal interactors can examine this shared interaction state in determining the next step in the dialog. A shared interaction state is important for the following multimodal acts:

Switching Modalities

User switches between interaction modalities owing to a number of external factors such as the available band-width between the user, the device and the network. For such transitions to be seamless, the data collected by each interaction modality, as well as the information conveyed via the available output media need to be driven by a shared interaction state.

History

The shared interaction state can track the history of user interaction, and this history can be useful in determining the most appropriate path through the dialog to achieve rapid task completion.

Multi-Device Interaction

A user with a personalized mobile device might wish to use a large visual display upon entering a conference room. To achieve a synchronized multimodal experience, the user's mobile device and the conference room display will need to share some interaction state.

Distributed Multimodality

It might be advantageous to offload complex speech processing tasks to network servers when using thin clients such as cell phones. As an example, a cell phone might be capable of local speech processing sufficient to enable the user to dial a small number of hotlist entries by speaking a name. If the name is not found in the hotlist, it may need to be looked up in a larger phone book, e.g., a company directory, and the speech processing required might be best offloaded to a network server. Sharing a common interaction state between the visual and spoken components of the cell-phone is essential for synchronized multimodal interaction in such distributed deployments.

Multimodal Interfaces Should Be Predictable

Multimodal interaction provides the user with a multiplicity of choices and often enables a given task to be performed in a number of different ways. But to be effective, the user interface needs to empower the user to intuitively arrive at these different means of completing a given task. Symmetric use of modalities where appropriate can significantly enhance the usability of applications along this dimension; for example, an interface that can accept input via speech or pen might visually highlight an input area while speaking an appropriately designed prompt. Where a specific modality is unavailable for a given task, e.g., “signatures may only be entered via pen input”, appropriate prompt design can help make the user implicitly aware of this restriction. Predictable multimodal user interfaces are important for:

Eliciting Correct Input

Appropriately designed prompts are important for eliciting the desired user input. This in turn can lead to rapid task completion and avoid user frustration when using noisy input channels such as speech.

What Can I Do?

Rich user interfaces can often leave the user impressed with the available features, but baffled as to what can be done next. Spoken interaction --combined with good visual user interface design-- can be leveraged to overcome this lost in space problem. Rich multimodal interfaces can use the shared interaction state and dialog history to create user interface wizards that guide the user through complex tasks.

Multimodal Interfaces Should Adapt To Users Environment

Finally, multimodal interfaces need to adapt to the user's environment to ensure that the most optimal means of completing a given task are made available to the user at any given time. In this context, optimality is determined by:

The user's needs and abilities
The abilities of the connecting device
Available band-width between device and network
Available band-width between device and user
Constraints placed by the user's environment, e.g., need for hands-free, eyes-free operation.

A. Biographical Information

T. V. Ramanhas worked in the areas of auditory user interfaces and structured documents since 1991. His graduate work on Audio System For Technical Readings (AsTeR) was awarded the ACM Doctoral Dissertation Award in 1994. Raman works in IBM Research on conversational and multimodal interaction and currently represents IBM in numerous W3C working groups including the multimodal interaction, XForms and voice browser working groups.