A generic formal model for fission of modalities in output multi-modal interactive systems

Output multi-modal human-machine interfaces combine semanticallyoutput medias and modalities in order to increase interaction machine capabilities. The aimof this paper is to present a formal model supporting formal specifications of output multi-modal interactive systems. As a consequence, the expected usability and robustne ss properties can be expressed and checked. This paper proposes a generic model which permits to specify the output multimodal interfaces following the CASE design space.


INTRODUCTION
The emergence of new interaction devices (touch-screens, haptic devices . . . ) and the use of machines in several situations, with diverse modes (visual, auditory . . . ) by different users made possible the emergence of several man-machine interaction capabilities.Setting up such interactions, namely multimodal interactions, increased the user interactions capabilities.Indeed, parallel, concomitant, synergistic, etc. interactions became possible.Moreover, these new interactions capabilities opened several applications for disabled people, in critical systems, in games developments, etc.
The use of multi-modal interfaces increases interactive capabilities of machines, but the sequential and parallel combination of modalities, increases the complexity of the information representation.Therefore, it leads to more complex development and validation processes.Moreover, the introduction of computing interactive systems in critical systems requires a high safety level and needs rigorous design approaches which guarantee the satisfaction of the expressed specifications.
Currently, the traditional approaches used for the specification of the functional core of a system are actually used for the development of interfaces.These traditional approaches are not well adapted to such interactive systems.The main drawback is their incapability to handle the user needs and requirements, from interactive Output multi-modal interaction: interaction in which the information produced by the functional core is split and returned to the user via different output modalities according to the interaction context.For example, when the user asks the system, about the trains going from the city A to the city B, the system answers by speech synthesis : " your query can be satisfied by the four following trains", and, at the same moment, it displays the detailed trains list.For this example, the text and speech synthesis modalities are used.The text modality is obtained by combining the visual mode to the media screen and the speech synthesis is obtained by combining the auditory mode to the media speaker.

OUTPUT MULTI-MODAL INTERACTION : A BRIEF STATE OF THE ART
Output multi-modal systems have been developed in many areas.Examples of such systems are: COMET [5] designed for the automatic generation of maintenance diagnosis of portable military radio.It combines visual modalities (text and graphics).Magic [6] was designed for the generation of cardiovascular postoperative briefings using visual and auditory modalities.Smartkom [7] supports the administration of diverse services: address book, booking hotels, ordering appliances.It is a symmetrical multi-modal system using the same modalities (speech, gesture, facial expression) for input and output interactions.
Multi-modality in general, and output multi-modality in particular, may be defined according to different design processes.Several work has been devoted to the characterization of multi-modal interfaces in order to assess and compare them.The most well known design space is CASE [8].It characterizes: temporality, statements scheduling and their involvement in the interactive task.The CASE design space suggests classifying multi-modality according to two axes: the composition of medias (sequential or parallel) and the relations between information (combined or independent).This classification leads to four multi-modal interaction types: concurrent (parallel and independent), alternate (sequential and combined), synergistic (parallel and combined) and exclusive (sequential and independent).Moreover, other work has focussed on modelling the design process of an output multi-modal interaction.For example, the SRM (Standard Reference Model) [10] builds the output interaction starting from the target goal.It involves five layers.
• Control layer to select the next goal to achieve (output interactive task).
• Content layer to first, refine the goal into more specialized subgoals, and second, for each elementary goal, to select the adequate pair (modality, media) and the presentation content.• Design layer to set the morphological presentation attributes (for example, used font size) and spatialtemporal attributes (timing and layout on the interface).• Realization layer, responsible of the generation of the effective presentation.
• Presentation display layer whose role is to distribute the different components of the presentation to the appropriate media and coordinate the various components to construct the global presentation.
The WWHT model (What, Which, How, Then) [11] (see Figure 1.) builds the output multi-modal interface by answering to the four following questions : • What: what is the provided information?
• Which: what is the chosen multi-modal presentation?
• How: how this presentation is instantiated?
• Then: how this presentation evolves?
Answers to these questions leads to follow the multi-modal output presentation architectural design process shown in Figure 1.It includes four steps.
• Semantic fission: it is a semantic decomposition of the information produced by the functional core into elementary information units that may be processed for output presentation purposes.It answers the question : What ?• Presentation allocation: for each elementary information unit, it selects the appropriate multi-modal presentation associated to the current state of the interaction context and consolidates the selected multi-modal presentations into a global one.It answers the question : Which ?• Instantiation: determines for the presentation modalities, lexico-syntactic content and morphological attributes according to the interaction context.It answers the question : How ?• Evolution: defines the evolution of the multi-modal presentation according to the interaction context change.This evolution can reset the presentation design, according to the interaction context change, either to the allocation phase, or to the instantiation phase.It answers the question : Then ?
From this brief overview of the state of the art of output multi-modal interface design, we can state that few work for defining development cycles for such interfaces has been performed.Moreover, we can assert that no formal modelling of the overviewed design spaces is currently available.In practice, only informal design processes are put into practice.The previously overviewed models do not offer a formal framework ensuring a sure design according to rigorous specifications or the possibility to verify functional nor usability properties.Our work proposes a generic and formal model for output multi-modal interaction design and validation based on the WWHT model.

THE FORMAL MODEL PROPOSED
The formal model we suggest, formally specifies a multi-modal output interactive system.It models the two first phases of the multi-modal presentation generation process (Figure 1), based on the WWHT model (semantic fission and allocation) in a formal framework and enriches it with additional useful operators.
This model formalizes the successive representations of output information throughout the refinement process of the information generated by the functional core induced by the four steps of the WWHT design model.This refinement leads to the multi-modal presentation.Therefore, two formal models: the fission model and the allocation model compose our formal global model.Each model is defined by its syntax and static and dynamic semantics.

The Fission model
The fission model expresses the semantic fission or decomposition of the information generated by the functional core into elementary information units delivered to the user.The description of the semantic fission model includes the description of the syntax and the static and dynamic semantics.The objective of fission is to describe the correct basic information unit composition (static semantics) and its temporal occurrence (dynamic semantics); a fission description parameterized for the CASE design space is introduced.Notice that other design spaces could have been defined.

Syntax
Let I be the set of continuous1 information to fission, and UIE be the set of elementary information units.The description of the fissioned elements is given by the following BNF2 rules.

Where
• op temp is a binary temporal operator belonging to T EMP = {An, S q, Ct, Cd, Pl, Ch, In}.
• op sem is a binary semantic operator belonging to S EM = {Cc, C p, Cr, Pr, T r}.
• It is a binary temporal operator expressing iteration.
The temporal and semantic binary operators are defined on traces of events that express the production of the information i i ∈ I resulting from the fission description.Their signatures are: op temp : I × I → I, op sem : I × I → I and It : N × I → I.
To define the meaning of the introduced temporal operators, let i i , i j be two information elements of I, then: • i i An i j for i i anachronic to i j i.e. i j occurs after an interval of time following the end of i i .
• i i S q i j for i i sequential to i j i.e. i j occurs immediately when i i ends.
• i i Ct i j for i i concomitant to i j i.e. i j occurs after the beginning of i i and ends after i i ends.
• i i Cd i j for i i coincident with i j i.e. i j occurs after the beginning of i i and ends before i i ends.
• i i Pl i j for i i parallel to i j i.e. i i and i j begin and end at the same moment.
• i i Ch i j for choice between i i and i j i.e. deterministic choice between i i and i j .
• i i In i j for independent order between i i and i j i.e. the temporal relation between i i and i j is unknown.
From the previous definition, we notice that the duration of events is not null.The definition of the semantic operators relies on the semantics of the information elements manipulated by the interface and produced by the functional core.Since our approach is generic, we assume that an interpretation function int interpreting the information elements over a semantic domain D equipped with the relevant operators is available.It is a parameter of the generic approach we propose.The definition of this function together with the semantic domain are given when the interface output information are specified.
Consider the interpretation function int which associates to an information, its semantic interpretation characterizing the multi-modality.Its signature is int : I → D. D is defined according to: the studied system, the interface user or designer.This domain is not specified in the paper, it depends on the functional core.However, the interpretation function is necessary to define the semantic operators of the S EM set.
• i i Cc i j for i i concurrent to i j : int(i i ) and int(i j ) are independent.
• i i C p i j for i i complementary to i j : int(i i ) and int(i j ) are complementary without any redundancy.
• i i Cr i j for i i complementary and redundant to i j : int(i i ) and int(i j ) are complementary and a part of their interpretations is redundant.• i i Pr i j for i i partially redundant to i j : int(i i ) is completely included in int(i j ) or int(i j ) is completely included in int(i i ).• i i T r i j for i i totally redundant to i j : int(i i ) and int(i j ) are equivalent.
Notice that independence, complementarity, inclusion, equivalence, etc. are operators defined on the domain D. Given an information i in I, and n an integer, It(n, i) expresses that the information i occurs n times.

Static Semantics
The first part of the static semantics has been defined by introducing the int interpretation function and the definition of the semantic operators of the S EM set.The second part is related to the definition of the static properties related to the duration of the information event occurrences.More precisely, it defines the duration of restitution of an information i.It is expressed for each elementary information unit by introducing time boundaries using the temporal relation T .
Consider T ime = t j the set of discrete time events and the functions start and end defined on I as follows: T is the temporal relation that combines information with its start and end event occurrence.

Dynamic Semantics
The dynamic semantics of the fission model addresses the temporal interleaving of i i event occurrence.It defines the temporal operators by means of the temporal relation T .
∀op temp ∈ {An, S q, Ct, Cd, Pl, Ch, In} ∀i i , i j ∈ I, with T (i i ) = (start(i i ), end(i i )), T (i j ) = (start(i j ), end(i j )) ∃i k ∈ E, with i k = i i op temp i j and T (i k ) = (start(i i ), end(i j )) iff op temp = An and end(i i ) < start(i j ) Finally, The temporal binary operator iter is defined.Given i i an information in I, n a natural integer greater than or equal to 1 iter(n, i i ) = (. . .((i i S q i i ) S q i i ) . . .S q i i ) n times .

Modelling of CASE space
The CASE formal model is obtained from the formal model previously described, by defining the syntax of the four multi-modality types of the CASE space.We start by describing the temporal and semantic operators allowed for each of the two values of the two space axes (use of media and relations between information).Then, the syntax grammar of the four types resulting from crossing the two axes (Concurrent, alternate, synergistic and exclusive) is provided.
1. Use of media.This axis refers to time scheduling.Thus, it concerns the use of temporal operators.
• Sequential: it constraints the temporal operators excluding the use of parallelism: anachronistic, sequential, choice and iteration.• Parallel: it constraints the temporal operators to use types of parallelism: concomitant, coincident and parallel.
2. Link between information.This axis refers to the semantic relationship between information and thereby it determines the semantic operators.
• Combined: restricts the semantic operators to the use of: complementary, complementary and redundant, partially redundant and totally redundant operators.• Independent: it restricts semantic operators to the use of concurrent operator.Thus, the four types of multi-modality resulting from the previous constraints are defined as follows:

Allocation model
Once the information crosses the fission process, the allocation process takes place.The proposed allocation model is based on the second phase of WWHT refinement process.It includes the operators devoted to allocation and proposed in the WWHT model: complementary and redundant, and two other operators we propose: choice and iteration.
The formal allocation model we introduce formalizes the allocation for each elementary information unit resulting from the fission process.They correspond to (modality, media) pairs, combined with the: complementary, redundant, choice and iteration operators in order to apply the usability choices for the output multi-modal interface.These usability choices are provided by the interface designer.Again, we follow the same structure, as for fission, to define the allocation model by describing syntax and semantics.

Syntax
The allocation of output information is described according to two BNF rules: the first one defines the multi-modal presentation pm for an information i, and the second one defines the elementary multi-modal presentation pme related to an elementary information unit uie.Here, the refinement process introduced the concrete presentation of the information issued from fission.
Consider PM as the set of multi-modal presentations, PME the set of elementary multi-modal presentations.We define the multi-modal presentation pm corresponding to the information i obtained after fission as the combination of the different elementary multi-modal presentations pme corresponding to the elementary information unit uie, also issued from fission process, which compose i.This correspondence is formalized as a morphism that maps information (i), information units (uie) together with the semantic (op sem ) and temporal (op temp ) operators, to multi-modal presentations of the sets PM and PME and the corresponding semantic (op ′ sem ) and temporal (op ′ temp ) operators acting a multi-modal presentation.The multi-modal presentation syntax PM is defined as follows : where n ∈ N where: op ′ temp is a binary temporal operator belonging to T EMP ′ = {An ′ , S q ′ , Ct ′ , Cd ′ , Pl ′ , Ch ′ , In ′ }.
It ′ is a binary temporal operator expressing iteration.
The binary temporal and semantic operators are defined on events traces expressing the production of the multi-modal presentations pm i ∈ PM.Their signature is given by: Given pm i , pm j two multi-modal presentations in PM, then the different operations are described as follows: • pm i An ′ pm j for pm i anachronic to pm j ie pm j occurs after an interval of time following the end of pm i .• pm i S q ′ pm j for pm i sequential to pm j ie pm j occurs immediately after pm i ends.
• pm i Ct ′ pm j for pm i concomitant to pm j ie pm j occurs after the beginning of pm i and ends after pm i .
• pm i Cd ′ pm j for pm i coincident to pm j ie pm j occurs after the beginning of pm i and ends before pm i ends.
• pm i Pl ′ pm j for pm i parallel to pm j ie pm i and pm j begin and end at the same moment.
• pm i Ch ′ pm j for choice between pm i and pm j ie the deterministic choice between pm i and pm j .
• pm i In ′ pm j for independent order between pm i and pm j ie the temporal relation between pm i and pm j is unknown.
Analogously to the definition of interpretation function int introduced for fission, the interpretation function int ′ : PM → D ′ is defined for multi-modal presentations.D ′ is the interpretation domain of multi-modal presentations, it is also defined according to the interface context or designer.The semantic operators are: • pm i Cc ′ pm j for pm i concurrent to pm j ie int ′ (pm i ) and int ′ (pm j ) are independents.
• pm i C p ′ pm j for pm i complementary to pm j ie int ′ (pm i ) and int ′ (pm j ) are complementary without any redundancy.• pm i Cr ′ pm j for pm i complementary and redundant to pm j ie int ′ (pm i ) and int ′ (pm j ) are complementary and that a part of their interpretations is redundant.• pm i Pr ′ pm j for pm i partially redundant to pm j ie int ′ (pm i ) is completely included in int ′ (pm j ) or that int ′ (pm j ) is completely included in int ′ (pm i ).
• pm i T r ′ pm j for pm i totally redundant to pm j ie int ′ (pm i ) and int ′ (pm j ) are equivalent.
Other specific sets defining modalities and medias are required to define the allocation.
Consider MOD to be the set of output modalities, MED the set of output medias, and the relation rest indicating whether a modality can be returned by a media or not; rest : MOD × MED → {true, f alse} ∀mod i ∈ MOD, ∀med j ∈ MED, mod i rest med j expresses that mod i can be returned by med j .
Consider IT EM the set of pairs (modality, media), such that the modality can be returned by the media.IT EM = (mod i , med j ) such as mod i ∈ MOD ∧ med j ∈ MED ∧ mod i rest med j And finally, the function a f f ect : UIE → IT EM which associates to each elementary information unit uie, a pair (modality, media) in IT EM set; Then, the syntax for PME is described by the following BNF grammar: Iter expresses n iterations of an elementary multi-modal presentation pme.
Once, the multi-modal presentations are defined (through the previous BNF rules), the repres function repres : PME → DR is introduced to associate representation in a representation domain DR.Here, DR is the representational interpretation domain associated to multi-modal presentations.
Hence, Compl, redon and choice operators expressing: • Compl: representational complementarity between two elementary multi-modal presentations to return an elementary information unit uie; compl : PME × PME → PME.• redon : representational redundancy between two elementary multi-modal presentations to return an elementary information unit uie; redon : PME × PME → PME.• choice : representational choice between two elementary multi-modal presentations to return an elementary information unit uie; choice : PME × PME → PME.

Static semantics
The static semantics of the allocation model expresses on the one hand, the duration of multimodal presentations pm and elementary multi-modal presentations pme, by the definition of their temporal boundaries, respectively, through the temporal functions T ′ and T ", and on the other hand, a set of properties defined on the syntactic elements of the model.These expressions make it possible to describe the usability properties of the interface.
Consider the start ′ and end ′ functions defined on PM as follows: ∃pme k ∈ PME such as pme k = pme i S eq pme j and T ′ (pme k ) = (start ′ (pme i ), end ′ (pme j )) The choice operator choice : PME × PME → PME is defined as follows: ∀pme i , pme j ∈ PME × PME ∃pme k ∈ PME such as pme k = pme i Choice pme j ⇒ pme k = pme i ∨ pme k = pm j

Properties expression and verification
Once, the output interface is formally modelled using the model described above, it is possible to verify some usability and robustness properties.For that, it is necessary to express these properties in the same formal language that we used to model the interface, in order to compare the two descriptions.Inclusion and or simulation relations between these descriptions permit to assert (or not) the satisfaction of these properties.
We propose to verify the robustness property consisting in establishing the absence of collisions.A collision corresponds to a parallel use of a non shareable media to present two modalities.Thus, the generation of a collision implies first, a parallel combination of two elementary multi-modal presentations (pm i and pm j ) using one of the three parallel temporal operators (concomitant, coincident and parallel); second, the use of the same non shareable media in these two elementary multi-modal presentations.
Formally, a collision is produced if the multi-modal presentation is described by the following expression: where pm i = a f f ect(uie i ) • pm ′ i and pm j = a f f ect(uie j ) • pm ′ j and a f f ect(uie i ) = (mod i , med) and a f f ect(uie j ) = (mod j , med) and ¬share(med) with: pm i , pm j , pm ′ i , pm ′ j ⊂ PM; uie i , uie j ⊂ UIE; mod i , mod j ⊂ MOD; med ∈ MED We currently study the possible concretization of this verification in verifying tools such as the Promela/Spin model checkers or B event proof method.

CASE STUDY
To illustrate the formal model described above, we use an output interaction scenario based on the SmartKom system [7].Smartkom is a system managing various applications related to communicative services (phone, fax, email ...), access to computing devices and positioning, navigation and road information services.The system SmartKom is a symmetrical multi-modal system.Output multi-modality is supported by the Smartakus conversational agent which has a collection of postures and facial expressions.
The modelled output multi-modal interaction scenario consists of a dialogue between the user and Smartakus.In reaction to the user request about Heidelberg city map, the conversational agent Smartakus answers to the request by means of speech synthesis: "Here you can see the map of the city" and displays at the same time the map of Heidelberg.

Modelling the output interaction
The interface which responds to the user request is modelled according to the scenario described above.We consider, I as the set of output information containing the singleton i ="Here you can see the map of the city" combined with the map of the city of Heidelberg, UIE as the set of elementary information units consisting of uie 1 = "Here you can see the map of the city" and uie 2 = map of the city of Heidelberg.Finally, we also consider the sets: MOD = {speech, f acial expression, picture} the set of output modes; MED = {screen, speaker} the set of medias used in Smartkom; IT EM = interface.This complexity is accentuated by the introduction of synchronization primitives and semantic references between information because of the multi-modal nature of the output interface.The introduction of refinement permitted to decompose the development of the output multi-modal interaction.On the other hand, once the formal model is established, the approach supports the a priori expression of usability properties related to the described interface.In addition, the proposed approach allows the designer to progressively and iteratively define the output multi-modal interface by introducing different characteristics throughout the modelling and refinement process.The proposed approach offers a generic design model and a parameterized design model according to the CASE space.
This work can be pursued in different directions.We mention two of them.Regarding the interface specification and in order to make the proposed model as generic as possible, it seems important to define other instantiations of the proposed model for other design spaces or other multi-modal interactive systems description approaches.The second direction relates to the verification of the expressed properties.We expect to define generic properties like collision (defined in this paper) or shareability qualifying the quality of the modelled output multi-modal interfaces.As a second step, we project to propose safe transformations of the specified output multi-modal system together with its properties, that preserve the semantics of the modelled system, into a target formal technique that supports the verification of desired properties.We are currently working on the definition of such transformations for the Promela/Spin model checker and the event B proof based formal method.
Concurrent type : I UIE | (op temp , op sem )(I, I) where op temp ∈ {Ct, Cd, Pl} and op sem ∈ {Cc} Alternate type : I UIE | (op temp , op sem )(I, I) | It(n, I) where n ∈ N , op temp ∈ {An, S q, Ch} and op sem ∈ {C p, Cr, Pr, T r} Synergistic type : I UIE | (op temp , op sem )(I, I) where op temp ∈ {Ct, Cd, Pl} and op sem ∈ {C p, Cr, Pr, T r} Exclusive type : I UIE | (op temp , op sem )(I, I) | It(n, I) where n ∈ N, op temp ∈ {An, S q, Ch} and op sem ∈ {Cc} is the beginning instant of pm i 3rd International Workshop on Verification and Evaluation of Computer and Communication Systems