Abstract
The coming of age of speech technologies means that it is now time to integrate speech interaction as a first-class citizen into human-computer user interfaces. This form of first-class spoken interaction requires more than speaking a visual presentation or having the user speak what would have been entered via the keyboard; leveraging the benefits of rich spoken interaction requires the design of conversational and multimodal user interfaces that exploit the unique advantages inherent in the available user interface modalities. This tutorial will cover aspects of conversational and multimodal user interaction including emerging Web standards that support such interaction, deployment frameworks that facilitate the creation and delivery of such user interaction, and the resulting challenges that emerge for researchers working on speech and natural language technologies.
Table of Contents
Open standards encourage interoperability. Below, we enumerate the relevant standards for authoring and deploying multimodal interaction.
| W3C XHTML |
| W3C XForms |
| W3C XML Events |
| W3C Voice XML |
| XHTML+Voice |
Multimodal interaction can be deployed locally on client devices or distributed across the network. Below, we enumerate some of the deployment scenarios:
| GUI and voice processing on local device, e.g., PDAs |
| GUI with limited speech processing on local device, e.g., smart phones. |
| GUI on client with voice processing across the network, e.g., cell phones. |
Each of these deployment scenarios requires various levels of synchronization across the available modalities.
Conversational interaction is relevant both for voice-only interfaces as well as multimodal interaction. Conversational interaction uses techniques from Natural Language Processing (NLP) to create rich dialog models that enable rapid task completion. When combined with multimodal interaction, conversational interaction faces the added challenge of needing to integrate across a multiplicity of inputs in computing user intent.