Conversational And Multimodal Interaction

David Nahamoo
T. V. Raman
IBM Research


Outline

  • Key drivers for new user interfaces.
  • Creating Conversational Interaction.
  • Creating Multimodal Interaction.
  • Using Web technologies.
  • Challenges for Speech technologies.

1.1


Outline

  • Key drivers for new user interfaces.
  • Creating Conversational Interaction.
  • Creating Multimodal Interaction.
  • Using Web technologies.
  • Challenges for Speech technologies.

1.2


Outline

  • Key drivers for new user interfaces.
  • Creating Conversational Interaction.
  • Creating Multimodal Interaction.
  • Using Web technologies.
  • Challenges for Speech technologies.

1.3


Outline

  • Key drivers for new user interfaces.
  • Creating Conversational Interaction.
  • Creating Multimodal Interaction.
  • Using Web technologies.
  • Challenges for Speech technologies.

1.4


Outline

  • Key drivers for new user interfaces.
  • Creating Conversational Interaction.
  • Creating Multimodal Interaction.
  • Using Web technologies.
  • Challenges for Speech technologies.

1.5


What Is UI?

2


What Is UI?

Application I O User RequestResponseIntentionAttention

3


What Is UI?

Computer applications
  • Obtain user input.
  • Compute on the information.
  • Display the result.

UI I/O

4


Man-machine Dialog

Building Blocks:

Conversational Gestures

5


Conversational Gestures





Graphical User Interface




Edit widgets
Message widgets




Answering Yes Or No




Toggles




Select Elements From Set




Radio groups
List boxes




Traversing Complex Structures




nextpreviouschild parent




6


Abstractions Reflect Technology

GUI —Consequence Of Available Peripherals
  • Exploits graphical display and pointing device.
  • Reflects impracticality of using natural language at the time.

GUI Primitive Sign Language

7


Research Question

Define next-generation UI building blocks.
  • GUI —WIMP.
  • SUI —Directed Dialogs.
  • MUI —Synchronized Interaction.
  • CUI —Free-form Dialog.
  • EUI —Context-aware Interaction.

8.1


Research Question

Define next-generation UI building blocks.
  • GUI —WIMP.
  • SUI —Directed Dialogs.
  • MUI —Synchronized Interaction.
  • CUI —Free-form Dialog.
  • EUI —Context-aware Interaction.

8.2


Research Question

Define next-generation UI building blocks.
  • GUI —WIMP.
  • SUI —Directed Dialogs.
  • MUI —Synchronized Interaction.
  • CUI —Free-form Dialog.
  • EUI —Context-aware Interaction.

8.3


Research Question

Define next-generation UI building blocks.
  • GUI —WIMP.
  • SUI —Directed Dialogs.
  • MUI —Synchronized Interaction.
  • CUI —Free-form Dialog.
  • EUI —Context-aware Interaction.

8.4


Research Question

Define next-generation UI building blocks.
  • GUI —WIMP.
  • SUI —Directed Dialogs.
  • MUI —Synchronized Interaction.
  • CUI —Free-form Dialog.
  • EUI —Context-aware Interaction.

8.5


Different Types Of Interaction

Task: Withdraw Money at an automated kiosk.


UI Interaction Style




GUIPoint and click


SUI Directed dialog


MUITap and talk


CUIFree-form dialog


EUI Context-aware interaction


9


Next Generation User Interfaces

Renew Man-machine Communication
  • Evolve to leverage 30 years of progress:
    • Leverage benefits of Moore’s law.
    • Speech technologies are now main-stream.
    • Language technologies help break out of the GUI sandbox.

Reduce impedance mismatch.

10


Types Of Interaction

Conversational And Multimodal Interaction.
  • CUI and MUI are mutually independent.
  • Speech interfaces use conversational interaction.
  • Multimodal interaction can leverage CUI.

Determine richness of end-user experience.

11


Why Conversational And Multimodal Interaction?

12


Human Language Technologies

Enrich end-user experience.
  • Base technologies are now mature.
  • Increase band-width of man-machine communication.
  • Enable ubiquitous information access.

Key for Pervasive Web access.

13


Next Generation Web

  • Ubiquitous Web access.
  • Web is an application deployment platform.
  • Deliver end-user services to diverse clients.

Multiple modalities enrich user experience.

14


Ubiquitous Web Access

Web access opens new frontiers.
  • Success determined by end-user experience.
  • Customize to users needs and abilities.

Conversational interfaces enrich user interaction.

15


Semantic Web

Richer user experience requires semantics.
  • See-only Web Semantic Web
  • Separate content, style and interaction.
  • Enhance Web infrastructure to enable
    • Rich end-user experience.
    • New Web transaction capabilities.

16


Research Topics

  • XML technologies for semantic encodings.
  • Late binding of user interaction for flexibility.
  • User interface paradigms for rich interaction.
  • Distributed technologies for rapid deployment.

Build on industry standards.

17


Creating Conversational Interaction

18


Conversational Interaction

  • Create natural man-machine dialogs.
  • Conversation driven by a common model.
  • Model updated to reflect current state.
  • Leads to rapid task completion.

Effective conversation rapid task completion.

19


Conversational Interaction

  • Common interaction state.
  • Interaction state tracks dialog history.
  • Dialog management mediated by multiple components.

Application context drives conversation flow.

20


Conversational Interaction

Conversational Dialog Management
  • Use NL technologies to interpret utterance.
  • Contextual prompt generation.
  • Resolve ambiguities based on history and context.

NL —basis for conversational interaction.

21


Deploying Conversational Interaction

  • Call-center automation.
  • Voice access to information portals.
  • Conversational access to Web Services.

22


Creating Multimodal Interaction

23


Multimodal Interaction

Integrate multiple means of user interaction.
  • Visual interaction.
  • Spoken interaction.
  • Pen interaction.

Enrich man-machine communication.

24


Anatomy Of User Interaction

User interaction composed of:
  • Common representation of interaction state.
  • User interface controls.
  • Binding UI controls to common state.
  • UI reflects underlying state at all times.

25


Interaction Management

Mediates among modalities.
  • Manages interaction state.
  • Responsible for synchronizing modalities.
  • Determine interaction flow.
  • Manage history of user interaction.
  • Integrate across user utterances.

26


Deploying Multimodal Interaction

Different user scenarios.
  • Run all processing locally on the client.
  • Use remote speech processing.
  • Maintain interaction state remotely.

Combinations based on available resources.

27


Shared Goals

Conversational and Multimodal:
  • Enable rapid task completion.
  • Enable robust user interaction.
  • Leverage available technology resources.

28


Building Blocks:
Web Technologies

29


Role Of Web Technologies

Ubiquitous access requires interoperability.
  • Open and standardized infrastructure,
  • Interoperable content based on standards,
  • Distributed operation via Web Services.

Standards define the Web platform.

30


Web Architecture

Author
Content
Deploy
App
Consume
UI
XML XML XML XFormsXHTML

31


Relevant Standards

Conversational And Multimodal Web Standards
  • XHTML —Holds Web content.
  • CSS —Styles Web content.
  • DOM —Document Object Model.
  • XForms —XML powered Web forms.
  • VoiceXML —Declarative speech dialogs.

Building blocks for innovation.

32


XHTML

Universal Container for Web Content.
  • Used to author Web pages.
  • Defines basic Web vocabulary.
  • Defines XHTML DOM for eventing.
  • Framework for attaching new modalities.

33


Delivering Web Interaction

  • XML delivers presentations with style.
    • XHTML Basic, CSS, SVG,.
    • SSML, Aural CSS,.

  • XML Events author DOM2 event listeners.
    • Declarative handlers for standard behavior.
    • VoiceXML handlers for spoken interaction.
    • ECMAScript handlers for visual interaction.

34


Aural CSS

XHTML author can:
  • Attach aural style to document content,
  • Use such content in prompts.

Enables audio formatting for rich presentation.

35


ACSS Style


P.romeo {voice-family: male;
 volume: loud;
 pause-before: 20ms;}
P.juliet {voice-family: female;
 volume: soft;}

36


Talking Web Page


<body ev:event="load"
  ev:handler="#sayHello">
  <p id="hJuliet" class="juliet">
    Romeo, Romeo, where art thou?
  </p>
  <p id="hRomeo" class="romeo">
    I am here. </p>
</body>

Document load invokes handler.

37


DOM2 Events

DOM2 defines interoperable eventing.
  • Exposed via XML Events.
  • Enables authors to attach behavior.
  • Flexible, extensible authoring syntax.

Defines Web interaction model.

38


DOM2 Event Propagation

html headbodya click here

39


DOM2 Event Propagation

Event flow when click here is activated.
  • Capture —Event travels from root to target,
  • Target —Event arrives at the target,
  • Bubble —Event bubbles back to the root.

40.1


DOM2 Event Propagation

Event flow when click here is activated.
  • Capture —Event travels from root to target,
  • Target —Event arrives at the target,
  • Bubble —Event bubbles back to the root.

40.2


DOM2 Event Propagation

Event flow when click here is activated.
  • Capture —Event travels from root to target,
  • Target —Event arrives at the target,
  • Bubble —Event bubbles back to the root.

40.3


VoiceXML 2.0

Bring power of Web development to voice.
  • Abstracts details of underlying technology.
  • Authors focus on man-machine dialogs.
  • Automates minutiæ of basic spoken dialogs.
  • Reduces cost of deployment.

41


Anatomy Of A VoiceXML Page

  • VoiceXML form specifies voice dialogs.
  • Dialogs consist of one or more fields.
  • Fields consist of grammars and prompts.

Declarative authoring of spoken dialogs.

42


X+V

X+V —Voice-enables the Web.
  • Handlers invoked by the DOM event loop.
  • Visual handlers written in ECMAScript.
  • Voice handlers written in Voice XML.

Add Voice Interaction To XHTML.

43


Voice Handler


<v:form id="sayHello">
      <v:block>
      <v:prompt xv:src="#hRomeo"/>
      <v:prompt xv:src="#hJuliet"/>
  </v:block>
</v:form>
    

44


Interaction Metaphors

Spoken input with visual confirmation.
  • Event focus triggers mixed-initiative dialog.
  • Dialog collects multiple fields.
  • Synchronization provides visual confirmation.

45


Interaction Metaphors

Talk and type
  • Mixed-initiative fallback to directed dialog.
  • Also true of multimodal interaction.
  • Attach directed dialogs to individual fields.
  • User can escape from mixed-initiative dialog.

X+V —only one dialog active at a time.

46


Interaction Metaphors

Talk or type
  • Mixed-initiative dialog active during pen input.
  • Enables user to talk or type.
  • Simply drop voice handlers on fields.

47


Standards In Development

Conversational And Multimodal Standards
  • XHTML2 —next generation XHTML.
  • EMMA —Extensible Multimodal Annotations.
  • W3C MMI —integrate W3C technologies.
  • VoiceXML 3.0 —next-generation dialog markup.

Multimodal framework ties the pieces together.

48


XForms — XML Powered Web Forms

  • Separates data being collected from the presentation.
  • Enables the binding of different user interfaces.
  • Common model captures interaction state.
  • Enables automatic synchronization across modalities.

Enables multimodal and multi-channel access.

49


Illustrative Examples

50


Travel Application

Goal: Book a trip.
  • Make arrangements for travel to EuroSpeech 2003,
  • Including airline and hotel reservations,
  • Optionally, shop for the best available fares.

Applications comprise of one or more sub-tasks.

51


Classifying Speech Tasks

Classifying tasks by number of turns required
  • Single-turn dialog
    • Capturing user intent.

  • Multi-turn dialogs
    • Complex, multi-step transactions.
    • Combination of directed and free-form dialogs.

52


Single-turn Dialogs




Technology Input Output






Grammar Constrain Finite



NL ClassifiersUnconstrained Finite



NL Parsers UnconstrainedStructured



Limited SyntaxIntent ClassesFree-form

53


Applications

Single-turn Dialogs
  • Call routing.
  • Package tracking.
  • Name dialing.

54


Multi-turn Dialogs

Task requires multiple turns.
  • Directed dialogs for constrained interaction.
  • Mixed-initiative dialogs for free-form interaction.

Mix and match to taste.

55


Applications

Multi-turn Dialogs
  • Financial transactions.
  • Buying a plane ticket.
  • Making a hotel reservation.

Automation requires multiple turns.

56


First Generation Voice Portals

  • Control interaction through directed dialogs.
  • Constrain user interaction via grammars.
  • Extract intent to route among applications.

Build complexity by aggregation.

57


Simple Multimodal Interaction

  • User interacts via multiple modalities.
  • User actions reflected in all modes.

Web forms can be speech-enabled today.

58


Mixed-initiative Dialog

Advanced VoiceXML application.
  • Enable user driven dialogs.
  • Enables rapid task completion.

Can be used stand-alone or with MMI.

59


Conversational Interaction

DirectedMixed-initiativeFree-form
  • Use NL semantics.
  • Leverage contextual knowledge.
  • Track interaction history and user preferences.
  • Achieve rich man-machine dialog.

At the cutting edge —VoiceXML 3.0.

60


Composite User Input

Composing across time and modalities.
  • Composing across time —Conversational interaction.
  • Composing across modalities —Multimodal interaction.

Richer interaction for rapid task completion.

61


Technology Challenges

62


Speech Technologies

  • Integrate ASR and TTS to create rich dialogs.
  • Improve ASR performance in specific contexts.
  • Improve TTS quality to improve the conversation.
  • Richer dialog management.

63


Superhuman
Speech Technologies

64


Natural Language Processing

  • Evolve NL techniques that enable rapid development.
  • Create NL frameworks that are extensible.
  • Integrate NL with underlying technologies.
    • Use NL to better process user input.
    • Use NL Generation to improve system output.

65


Next Generation Dialogs

  • Develop next-generation dialog frameworks.
  • Enable pluggable dialog strategies.
  • Enable different interaction metaphors:
    • Butler —Jeeves.
    • Wizards (and muggles?).
    • Personal assistant.

Introduce new paradigms for user interaction.

66


Deploying Solutions

  • Enable re-use of user interface components.
  • Enable distributed Web-based deployment.
  • Scalable, affordable deployment.

Technology Challenge

Synchronized interaction on distributed Web.

67


Conclusion

  • Conversational interaction is the next UI frontier.
  • Multimodal interaction key for pervasive access.

Together define next set of technology challenges.

68


PIC

69