Conversational And Multimodal
Interaction
David Nahamoo
T. V. Raman
IBM Research |
Outline
-
Key
drivers
for
new
user
interfaces.
-
Creating
Conversational
Interaction.
-
Creating
Multimodal
Interaction.
-
Using
Web
technologies.
- Challenges for Speech technologies.
|
1.1
Outline
-
Key
drivers
for
new
user
interfaces.
-
Creating
Conversational
Interaction.
-
Creating
Multimodal
Interaction.
-
Using
Web
technologies.
- Challenges for Speech technologies.
|
1.2
Outline
-
Key
drivers
for
new
user
interfaces.
-
Creating
Conversational
Interaction.
-
Creating
Multimodal
Interaction.
-
Using
Web
technologies.
- Challenges for Speech technologies.
|
1.3
Outline
-
Key
drivers
for
new
user
interfaces.
-
Creating
Conversational
Interaction.
-
Creating
Multimodal
Interaction.
-
Using
Web
technologies.
- Challenges for Speech technologies.
|
1.4
Outline
-
Key
drivers
for
new
user
interfaces.
-
Creating
Conversational
Interaction.
-
Creating
Multimodal
Interaction.
-
Using
Web
technologies.
- Challenges for Speech technologies.
|
1.5
2
What Is UI?
Application
I O
User
RequestResponseIntentionAttention
|
3
What Is UI?
Computer applications
-
Obtain
user
input.
-
Compute
on
the
information.
-
Display
the
result.
UI I/O
|
4
Man-machine Dialog
Building Blocks:
Conversational Gestures
|
5
Conversational Gestures
|
|
|
| | Graphical User Interface
|
|
|
|
| | Edit widgets | Message widgets
|
|
|
|
| | Answering Yes Or No
|
|
|
|
| | Toggles
|
|
|
|
| | Select Elements From Set
|
|
|
|
| | Radio groups | List boxes
|
|
|
|
| | Traversing Complex Structures
|
|
|
|
| | next | previous | child | parent |
|
|
|
| | |
|
6
Abstractions Reflect Technology
GUI —Consequence Of Available Peripherals
-
Exploits
graphical
display
and
pointing
device.
-
Reflects
impracticality
of
using
natural
language
at
the
time.
GUI ↔ Primitive Sign Language
|
7
Research Question
Define next-generation UI building blocks.
-
GUI
—WIMP.
-
SUI
—Directed
Dialogs.
-
MUI
—Synchronized
Interaction.
-
CUI
—Free-form
Dialog.
- EUI —Context-aware Interaction.
|
8.1
Research Question
Define next-generation UI building blocks.
-
GUI
—WIMP.
-
SUI
—Directed
Dialogs.
-
MUI
—Synchronized
Interaction.
-
CUI
—Free-form
Dialog.
- EUI —Context-aware Interaction.
|
8.2
Research Question
Define next-generation UI building blocks.
-
GUI
—WIMP.
-
SUI
—Directed
Dialogs.
-
MUI
—Synchronized
Interaction.
-
CUI
—Free-form
Dialog.
- EUI —Context-aware Interaction.
|
8.3
Research Question
Define next-generation UI building blocks.
-
GUI
—WIMP.
-
SUI
—Directed
Dialogs.
-
MUI
—Synchronized
Interaction.
-
CUI
—Free-form
Dialog.
- EUI —Context-aware Interaction.
|
8.4
Research Question
Define next-generation UI building blocks.
-
GUI
—WIMP.
-
SUI
—Directed
Dialogs.
-
MUI
—Synchronized
Interaction.
-
CUI
—Free-form
Dialog.
- EUI —Context-aware Interaction.
|
8.5
Different Types Of Interaction
Task: Withdraw Money at an automated kiosk.
|
| | UI | Interaction Style |
|
|
|
| | GUI | Point and click |
|
| | SUI | Directed dialog |
|
| | MUI | Tap and talk |
|
| | CUI | Free-form dialog |
|
| | EUI | Context-aware interaction |
|
| | |
|
9
Next Generation User Interfaces
Renew Man-machine Communication
- Evolve to leverage 30 years of progress:
-
Leverage
benefits
of
Moore’s
law.
-
Speech
technologies
are
now
main-stream.
-
Language
technologies
help
break
out
of
the
GUI
sandbox.
Reduce impedance mismatch.
|
10
Types Of Interaction
Conversational And Multimodal Interaction.
-
CUI
and
MUI
are
mutually
independent.
-
Speech
interfaces
use
conversational
interaction.
-
Multimodal
interaction
can
leverage
CUI.
Determine richness of end-user experience.
|
11
Why Conversational And
Multimodal Interaction?
|
12
Human Language Technologies
Enrich end-user experience.
-
Base
technologies
are
now
mature.
-
Increase
band-width
of
man-machine
communication.
-
Enable
ubiquitous
information
access.
Key for Pervasive Web access.
|
13
Next Generation Web
-
Ubiquitous
Web
access.
-
Web
is
an
application
deployment
platform.
-
Deliver
end-user
services
to
diverse
clients.
Multiple modalities enrich user experience.
|
14
Ubiquitous Web Access
Web access opens new frontiers.
-
Success
determined
by
end-user
experience.
-
Customize
to
users
needs
and
abilities.
Conversational interfaces enrich user interaction.
|
15
Semantic Web
Richer user experience requires semantics.
-
See-only Web
Semantic Web
-
Separate
content,
style
and
interaction.
- Enhance Web infrastructure to enable
-
Rich
end-user
experience.
-
New
Web
transaction
capabilities.
|
16
Research Topics
-
XML
technologies
for
semantic
encodings.
-
Late
binding
of
user
interaction
for
flexibility.
-
User
interface
paradigms
for
rich
interaction.
-
Distributed
technologies
for
rapid
deployment.
Build on industry standards.
|
17
Creating Conversational
Interaction
|
18
Conversational Interaction
-
Create
natural
man-machine
dialogs.
-
Conversation
driven
by
a
common
model.
-
Model
updated
to
reflect
current
state.
-
Leads
to
rapid
task
completion.
Effective conversation → rapid task completion.
|
19
Conversational Interaction
-
Common
interaction
state.
-
Interaction
state
tracks
dialog
history.
-
Dialog
management
mediated
by
multiple
components.
Application context drives conversation flow.
|
20
Conversational Interaction
Conversational Dialog Management
-
Use
NL
technologies
to
interpret
utterance.
-
Contextual
prompt
generation.
-
Resolve
ambiguities
based
on
history
and
context.
NL —basis for conversational interaction.
|
21
Deploying Conversational Interaction
-
Call-center
automation.
-
Voice
access
to
information
portals.
-
Conversational
access
to
Web
Services.
|
22
Creating Multimodal Interaction
|
23
Multimodal Interaction
Integrate multiple means of user interaction.
-
Visual
interaction.
-
Spoken
interaction.
-
Pen
interaction.
-
…
Enrich man-machine communication.
|
24
Anatomy Of User Interaction
User interaction composed of:
-
Common
representation
of
interaction
state.
-
User
interface
controls.
-
Binding
UI
controls
to
common
state.
-
UI
reflects
underlying
state
at
all
times.
|
25
Interaction Management
Mediates among modalities.
-
Manages
interaction
state.
-
Responsible
for
synchronizing
modalities.
-
Determine
interaction
flow.
-
Manage
history
of
user
interaction.
-
Integrate
across
user
utterances.
|
26
Deploying Multimodal Interaction
Different user scenarios.
-
Run
all
processing
locally
on
the
client.
-
Use
remote
speech
processing.
-
Maintain
interaction
state
remotely.
Combinations based on available resources.
|
27
Shared Goals
Conversational and Multimodal:
-
Enable
rapid
task
completion.
-
Enable
robust
user
interaction.
-
Leverage
available
technology
resources.
|
28
Building Blocks:
Web Technologies
|
29
Role Of Web Technologies
Ubiquitous access requires interoperability.
-
Open
and
standardized
infrastructure,
-
Interoperable
content
based
on
standards,
-
Distributed
operation
via
Web
Services.
Standards define the Web platform.
|
30
Web Architecture
XML XML XML
XForms ⋯ XHTML
|
31
Relevant Standards
Conversational And Multimodal Web Standards
-
XHTML
—Holds
Web
content.
-
CSS
—Styles
Web
content.
-
DOM
—Document
Object
Model.
-
XForms
—XML
powered
Web
forms.
-
VoiceXML
—Declarative
speech
dialogs.
Building blocks for innovation.
|
32
XHTML
Universal Container for Web Content.
-
Used
to
author
Web
pages.
-
Defines
basic
Web
vocabulary.
-
Defines
XHTML
DOM
for
eventing.
-
Framework
for
attaching
new
modalities.
|
33
Delivering Web Interaction
-
XML delivers presentations with style.
-
XHTML
Basic,
CSS,
SVG,
….
-
SSML,
Aural
CSS,
….
- XML Events author DOM2 event listeners.
-
Declarative
handlers
for
standard
behavior.
-
VoiceXML
handlers
for
spoken
interaction.
-
ECMAScript
handlers
for
visual
interaction.
|
34
Aural CSS
XHTML
author
can:
-
Attach
aural
style
to
document
content,
-
Use
such
content
in
prompts.
Enables audio formatting for rich presentation.
|
35
ACSS Style
P.romeo {voice-family: male;
volume: loud;
pause-before: 20ms;}
P.juliet {voice-family: female;
volume: soft;}
|
36
Talking Web Page
<body ev:event="load"
ev:handler="#sayHello">
<p id="hJuliet" class="juliet">
Romeo, Romeo, where art thou?
</p>
<p id="hRomeo" class="romeo">
I am here. </p>
</body>
Document load invokes handler.
|
37
DOM2 Events
DOM2 defines interoperable eventing.
-
Exposed
via
XML
Events.
-
Enables
authors
to
attach
behavior.
-
Flexible,
extensible
authoring
syntax.
Defines Web interaction model.
|
38
DOM2 Event Propagation
html
head
…
body
…
a
click here …
|
39
DOM2 Event Propagation
Event flow when click here is activated.
-
Capture
—Event
travels
from
root
to
target,
-
Target
—Event
arrives
at
the
target,
- Bubble —Event bubbles back to the root.
|
40.1
DOM2 Event Propagation
Event flow when click here is activated.
-
Capture
—Event
travels
from
root
to
target,
-
Target
—Event
arrives
at
the
target,
- Bubble —Event bubbles back to the root.
|
40.2
DOM2 Event Propagation
Event flow when click here is activated.
-
Capture
—Event
travels
from
root
to
target,
-
Target
—Event
arrives
at
the
target,
- Bubble —Event bubbles back to the root.
|
40.3
VoiceXML 2.0
Bring power of Web development to voice.
-
Abstracts
details
of
underlying
technology.
-
Authors
focus
on
man-machine
dialogs.
-
Automates
minutiæ
of
basic
spoken
dialogs.
-
Reduces
cost
of
deployment.
|
41
Anatomy Of A VoiceXML Page
-
VoiceXML
〈form〉
specifies
voice
dialogs.
-
Dialogs
consist
of
one
or
more
fields.
-
Fields
consist
of
grammars
and
prompts.
Declarative authoring of spoken dialogs.
|
42
X+V
X+V —Voice-enables the Web.
-
Handlers
invoked
by
the
DOM
event
loop.
-
Visual
handlers
written
in
ECMAScript.
-
Voice
handlers
written
in
Voice
XML.
Add Voice Interaction To XHTML.
|
43
Voice Handler
<v:form id="sayHello">
<v:block>
<v:prompt xv:src="#hRomeo"/>
<v:prompt xv:src="#hJuliet"/>
</v:block>
</v:form>
|
44
Interaction Metaphors
Spoken input with visual confirmation.
-
Event
focus
triggers
mixed-initiative
dialog.
-
Dialog
collects
multiple
fields.
-
Synchronization
provides
visual
confirmation.
|
45
Interaction Metaphors
Talk and type
-
Mixed-initiative
fallback
to
directed
dialog.
-
Also
true
of
multimodal
interaction.
-
Attach
directed
dialogs
to
individual
fields.
-
User
can
escape
from
mixed-initiative
dialog.
X+V —only one dialog active at a time.
|
46
Interaction Metaphors
Talk or type
-
Mixed-initiative
dialog
active
during
pen
input.
-
Enables
user
to
talk
or
type.
-
Simply
drop
voice
handlers
on
fields.
|
47
Standards In Development
Conversational And Multimodal Standards
-
XHTML2
—next
generation
XHTML.
-
EMMA
—Extensible
Multimodal
Annotations.
-
W3C
MMI
—integrate
W3C
technologies.
-
VoiceXML
3.0
—next-generation
dialog
markup.
Multimodal framework ties the pieces together.
|
48
XForms — XML Powered Web Forms
-
Separates
data
being
collected
from
the
presentation.
-
Enables
the
binding
of
different
user
interfaces.
-
Common
model
captures
interaction
state.
-
Enables
automatic
synchronization
across
modalities.
Enables multimodal and multi-channel access.
|
49
50
Travel Application
Goal: Book a trip.
-
Make
arrangements
for
travel
to
EuroSpeech
2003,
-
Including
airline
and
hotel
reservations,
-
Optionally,
shop
for
the
best
available
fares.
Applications comprise of one or more sub-tasks.
|
51
Classifying Speech Tasks
Classifying tasks by number of turns required
- Single-turn dialog
- Multi-turn dialogs
-
Complex,
multi-step
transactions.
-
Combination
of
directed
and
free-form
dialogs.
|
52
Single-turn Dialogs
|
|
| | Technology | Input | Output |
|
|
|
|
|
| | Grammar | Constrain | Finite |
|
|
| | NL Classifiers | Unconstrained | Finite |
|
|
| | NL Parsers | Unconstrained | Structured |
|
|
| | |
Limited Syntax … Intent Classes … Free-form
|
53
Applications
Single-turn Dialogs
-
Call
routing.
-
Package
tracking.
-
Name
dialing.
|
54
Multi-turn Dialogs
Task requires multiple turns.
-
Directed
dialogs
for
constrained
interaction.
-
Mixed-initiative
dialogs
for
free-form
interaction.
Mix and match to taste.
|
55
Applications
Multi-turn Dialogs
-
Financial
transactions.
-
Buying
a
plane
ticket.
-
Making
a
hotel
reservation.
Automation requires multiple turns.
|
56
First Generation Voice Portals
-
Control
interaction
through
directed
dialogs.
-
Constrain
user
interaction
via
grammars.
-
Extract
intent
to
route
among
applications.
Build complexity by aggregation.
|
57
Simple Multimodal Interaction
-
User
interacts
via
multiple
modalities.
-
User
actions
reflected
in
all
modes.
Web forms can be speech-enabled today.
|
58
Mixed-initiative Dialog
Advanced VoiceXML application.
-
Enable
user
driven
dialogs.
-
Enables
rapid
task
completion.
Can be used stand-alone or with MMI.
|
59
Conversational Interaction
Directed … Mixed-initiative … Free-form
-
Use
NL
semantics.
-
Leverage
contextual
knowledge.
-
Track
interaction
history
and
user
preferences.
-
Achieve
rich
man-machine
dialog.
At the cutting edge —VoiceXML 3.0.
|
60
Composite User Input
Composing across time and modalities.
-
Composing
across
time
—Conversational
interaction.
-
Composing
across
modalities
—Multimodal
interaction.
Richer interaction for rapid task completion.
|
61
62
Speech Technologies
-
Integrate
ASR
and
TTS
to
create
rich
dialogs.
-
Improve
ASR
performance
in
specific
contexts.
-
Improve
TTS
quality
to
improve
the
conversation.
-
Richer
dialog
management.
|
63
Superhuman
Speech Technologies
|
64
Natural Language Processing
-
Evolve
NL
techniques
that
enable
rapid
development.
-
Create
NL
frameworks
that
are
extensible.
- Integrate NL with underlying technologies.
-
Use
NL
to
better
process
user
input.
-
Use
NL
Generation
to
improve
system
output.
|
65
Next Generation Dialogs
-
Develop
next-generation
dialog
frameworks.
-
Enable
pluggable
dialog
strategies.
- Enable different interaction metaphors:
-
Butler
—Jeeves.
-
Wizards
(and
muggles?).
-
Personal
assistant.
Introduce new paradigms for user interaction.
|
66
Deploying Solutions
-
Enable
re-use
of
user
interface
components.
-
Enable
distributed
Web-based
deployment.
-
Scalable,
affordable
deployment.
Technology Challenge
Synchronized interaction on distributed Web.
|
67
Conclusion
-
Conversational
interaction
is
the
next
UI
frontier.
-
Multimodal
interaction
key
for
pervasive
access.
Together define next set of technology challenges.
|
68
69