Paper ID: 2211.13804

On the Linguistic and Computational Requirements for Creating Face-to-Face Multimodal Human-Machine Interaction

João Ranhel, Cacilda Vilela de Lima

In this study, conversations between humans and avatars are linguistically, organizationally, and structurally analyzed, focusing on what is necessary for creating face-to-face multimodal interfaces for machines. We videorecorded thirty-four human-avatar interactions, performed complete linguistic microanalysis on video excerpts, and marked all the occurrences of multimodal actions and events. Statistical inferences were applied to data, allowing us to comprehend not only how often multimodal actions occur but also how multimodal events are distributed between the speaker (emitter) and the listener (recipient). We also observed the distribution of multimodal occurrences for each modality. The data show evidence that double-loop feedback is established during a face-to-face conversation. This led us to propose that knowledge from Conversation Analysis (CA), cognitive science, and Theory of Mind (ToM), among others, should be incorporated into the ones used for describing human-machine multimodal interactions. Face-to-face interfaces require an additional control layer to the multimodal fusion layer. This layer has to organize the flow of conversation, integrate the social context into the interaction, as well as make plans concerning 'what' and 'how' to progress on the interaction. This higher level is best understood if we incorporate insights from CA and ToM into the interface system.

Submitted: Nov 24, 2022