August 24, 2016

Intention (WP3)


One of the pillars of human-human communication is Intention. The ability to predict and understand what others will do gives us clear advantages compared to other species [1]. The same holds for a robot involved in human-robot-interaction – “reading alienbrainreader-2400pxthe mind” of the human makes it possible for the robot to be pro-active and act on incomplete or incorrect information. It also makes it possible for a social robot to assist a human without explicit commands. Robots capable of this will give rise to “a fundamentally new kind of collaboration between humans and robots” [2]. Intention recognition for robots in general has been thoroughly addressed in our earlier research, both at the sensory motor level [3], and at higher levels [4] with several interaction modalities. Speech is often a preferred mode of interaction for applications with robots in health and eldercare [5]. In an assumed scenario, an older adult wants to eat, gets up from the chair and moves towards the kitchen. The person may use an explicit imperative command such as “Make me a sandwich”, or an implicit declarative sentence such as “I am hungry”. Both sentences could represent the same underlying intention of asking for help with the sandwich, but require very different inference mechanisms. The robot could also use video image analysis and draw the same conclusion by observing the human moving towards the kitchen. The work in WP3 will examine how such varying character and quality of interaction can be adjusted for and combined in order to infer human intention in advanced and novel ways. The ESRs will first conduct data collection and a user study at ARPAL, to investigate how older adults’ preference for implicit versus explicit interaction depend on factors such as the robot’s capabilities, real and as perceived by the human. They will then develop techniques for recognition of intention from speech and vision. ESR6 will incorporate inferred intention at a higher level in the dialogue manager, with the aim of dealing with certain age-related dialogue phenomena such as change of topic and repetitions. A potential application for the research in WP3 is robots that support ADL, such as PAL-R’s TIAGo, or FHG’s Car-O-bot. Equipping this kind of robots with a multimodal system for advanced intention recognition has a potential to improve interaction, user satisfaction, and task performance.

Tasks and Deliverables

T3.1 Development of algorithms for inference of intention from natural language analysis combined with planning (ESR4)
T3.2 Development of algorithms for inference of intention from visual sensors (ESR5)
T3.4 Dialogue management with varying Interaction Quality (ESR6)
T3.3 Combined methods for inference of intention from natural language and vision (ESR4/ESR5/ESR6)

D3.1 Algorithms for inference of intention and action from speech (ESR4) M23
D3.2 Algorithms for inference of intention from vision (ESR5) M23
D3.3 Report on user study with older adults violations of social dialogue norms (ESR6) M23
D3.4 Report on inference of intention from vision fused with speech (ESR5/ESR4) M32
D3.5 Algorithms for dialogue management (ESR6) M32
D3.6 Report on user-tests of algorithms for inference of intention from fused language/vision (ESR4/ESR5/ESR6) M40

Involved ESRs

ESR4 (UMU) Implicit intention recognition by integrating speech and task planning

ESR4 will develop algorithms to infer human intention from sentences of varying linguistics types, representing vastly different Interaction Quality. For imperative sentences, we have earlier[6] shown how semantic roles can be mapped to intention using machine learning techniques, and will extend this work by using dependency parsing for analysis of compound noun phrases. Grounding to physical objects will be done by extending earlier work[7] in which priming in semantic networks was shown to reduce perceptual ambiguity in intention recognition. For declarative sentences, a task planner will be incorporated. Planners have been previously used in robotics together with imperative sentences that define new planning goals[8],[9]. In our approach, the planner will be equipped with static goals (such as keeping the user satiated), and uttered declarative sentences will add facts to the current state. The planner will then generate a plan that can move the robot from current state to the goal state. In this way, the planner is used as a reasoning tool for implicit intention recognition, rather than a problem solving task planner. For the example above, the declarative statement “I am hungry” would evoke a plan of making a sandwich, in order to maintain the goal of keeping the human satiated. ESR4 will use data collected by ESR5 and ESR6 at ARPAL. During a secondment to ESR5@FHG, visual-sensor based intention recognition will be integrated. During an industrial secondment to PAL-R, results will be implemented and evaluated on the TIAGo robotic platforms (this is a tentative plan that may be adjusted to best fit actual research).

ESR5 (FHG) Implicit intention recognition using visual sensors

ESR5 will develop techniques by which the robot infers the intention and activity of a human. The basis will be localisation of human body parts object using an RGB-D sensor. We will make use of previous work in the ACCOMPANY project. The human activity recognition was there based on a graphical model to recognise activity sequence based on RGB-D videos[10]. It used latent variables to exploit sub-level semantics of human activities. The model showed outperforming results over the state-of-the-art approach[11]. The object recognition system contains methods to combine different sensor modalities to propose a fast, scale invariant local feature descriptor for recognising textured objects. An extension for the detection uses a global, combined 2d/3d feature descriptor[12]. Additional work dealt with development of fast global 3D shape descriptors[13] as well as preliminary work on human-comprehensible texture and 3d shape descriptors[14].

ESR5 will combine these two technologies in order to classify typical actions of a human and the objects in it. For the example when preparing a sandwich, the robot would identify the location of the user (kitchen) and relevant objects (piece of bread, knife) he/she is interacting with, classify the activity using spatio-temporal relations and previously learned activity classes (preparing food on the table). It could then offer suitable assistance (e.g. fetch specific ingredients from another room or list recipes on the screen) to support the user. The user will then not have to command to the robot for detailed assistance. The system is in this way capable of recognising and acting upon varying Interaction Quality caused by a user losing interest in verbally commanding the robot. ESR5 will implement suitable software components to solve this task using the Care-O-bot 4 assistive robot. Initially, ESR5 will, together with ESR6, conduct a user study during a secondment to ESR14@UWE. During a secondment to ESR1@HAM, algorithms for emotion recognition will be integrated to support intention recognition. During an industrial secondment to ABB, results will be implemented on their YuMi robot (this is a tentative plan that may be adjusted to best fit actual research).

ESR6 (UMU) Intention driven dialogue management

Dialogue management is responsible for deciding on the next appropriate robot step (including verbal utterances) based on user intention, current state, dialogue history, context, and purpose of the dialogue. Dialogue management approaches are traditionally divided into knowledge-based dialogue management (i.e. hand-crafted finite-state and planning approaches) and data-driven approaches[15]. Recent hybrid approaches to dialog management combine the benefits of both traditional approaches and avoid the disadvantages[16]. Phenomena such as sudden changes of topic, need of clarification, ambiguity, turn taking, misunderstandings, and non-understandings influence the character and quality of human-robot interaction based on dialogue. A social robot must be able to identify and adapt to such varying Interaction Quality in an effective and efficient way. In the context of eldercare, we encounter specific problems. For instance, accepted norms for dialogues, such as Grice’s conversational maxims are not always followed due to various age-related inabilities[17] (e.g. unmotivated topic changes).

ESR6 will investigate to what extent dialogue phenomena are caused by, or correlated to, violations of Grice’s maxims. The end-goal is to develop hybrid dialogue management approaches that 1) detect breaches of accepted norms for dialogue (e.g. Grice’s maxims) and 2) adapt dialogue management to the varying Interaction Quality. Implicit intention (recognised from language and vision) and emotion will be used as input to the dialog manager to promote decision making. Our hybrid dialog management approach will use novel finite-state based methods similar to our previously developed automata that include memory[18] and novel graph-transformation approaches for dialog management defined as logical interfaces, thus extending our earlier work[19],[20].

ESR6 will first, together with ESR5, conduct a user study during a secondment to ESR14@UWE, in order to investigate the occurrence of dialogue phenomena. Dialogues between staff and older adults will be recorded and analysed. Using implicit intention as input to the dialog manager will be investigated in close collaboration with ESR4. Intention from vision will be considered during a secondment to ESR5@FHG. The ability to deal with varying Interaction Quality will be evaluated in a series of dialogue simulations. A prototype dialogue management system will be developed, and during a secondment to PAL-R, implemented on the TIAGo robot (this is a tentative plan that may be adjusted to best fit actual research).


[1] Tomasello, M., Carpenter, M., Call, J., Behne, T., & Moll, H. (2005). Understanding and sharing intentions: The origins of cultural cognition Behavioral and Brain Sciences, 28 (05).

[2] A. G. Hofmann and B. C. Williams (2007) Intent Recognition for Human-Robot Interaction. In Interaction Challenges for Intelligent Assistans, Papers from the 2007 AAAI Spring Symposium, Stanford University, CA, USA, pages 60-61. AAAI Press.

[3] T.Hellström, B Fonooni. Applying a priming mechanism for intention recognition in shared control, IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA), Orlando, FL, USA, March 2015.

[4] Jevtić, A., Doisy, G., Parmet, Y. and Edan, Y. Comparison of Interaction Modalities for Mobile Indoor Robot Guidance: Direct Physical Interaction, Person Following, and Pointing Control. In IEEE Transactions on Human-Machine Systems, 45(6):653-663, Dec. 2015.

[5] Teixeria, A., A critical analysis of speech-based interaction in healthcare robots: making a case for increased use of speech in medical and assistive robots, in Speech and Automata in Health Care, Amy Neustein (ed.), 2014.

[6] A.Sutherland, S. Bensch, and T. Hellström. Inferring robot actions from verbal commands using shallow semantic parsing. In Hamid Arabnia, editor, Proceedings of the 17th International Conference on Artificial Intelligence ICAI’15, July 2015.

[7] B. Fonooni, T. Hellström, and L. Janlert. Priming as a means to reduce ambiguity in from demonstration. Int. J. of Social Robotics, 2015.

[8] Keller, T., Eyerich, P., and Nebel, B., Task Planning for an Autonomous Service Robot. Proc. of the 33rd ann. German conf. on Advances in Art. Int., 2010.

[9] T.M. Howard, S. Tellex, and N. Roy, A Natural Language Planner Interface for Mobile Manipulators, in Proceedings of the 2014 International Conference on Robotics and Automation, 2014.

[10] N. Hu, R. Bormann, T. Zwölfer, and B. Kröse. Multi-user identification and efficient user approaching by fusing robot and ambient sensors. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 5299–5306, May 2014.

[11] H. S. Koppula, R. Gupta, and A. Saxena. Learning Human Activities and Object Affordances from RGB-D Videos. International Journal of Robotics Research, 32(8):951–970, 2013.

[12] J. Fischer, R. Bormann, G. Arbeiter, and A. Verl. A feature descriptor for textureless object representation using 2d and 3d cues from rgb-d data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 2104–2109, 2013.

[13] R. Bormann, J. Fischer, G. Arbeiter, and A. Verl. Efficient object categorization with the surface-approximation polynomials descriptor. In C. Stachniss, K. Schill, and D. Uttal, editors, Spatial Cognition VIII, volume 7463 of Lecture Notes in Computer Science, pages 34–53, Springer, 2012.

[14] R. Bormann, C. Eckard, J. Hampp, and M. Hägele. Fast and Accurate Normal Estimation by Efficient 3d Edge Detection. submitted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015.

[15] Lee, C., Jung, S., Kim, K., Lee, D., and Lee, G.G. (2010). Recent Approaches to Dialog Management for Spoken Dialog Systems. Journal of Computing Science and Engineering, 2010. 4(1): 1-22.

[16] Lison, P. (2015) A hybrid approach to dialogue managment based on probabilisitc rules. Computer, Speech and Language 34(1): 232-255.

[17] Stroinska, M. (2015). Keep the conversation going! Gricean maxims in the old age. Proceedings of 3rd International Conference on Alzheimer’s Disease & Dementia, 2015 Toronto, Canada.

[18] S. Bensch, M. Holzer, M. Kutrib, A. Malcher. (2012) Input-Driven Stack Automata, Theoretical Computer Science 7th Int. Conf., 2012, Proc., 28-42, 2012.

[19] Bensch, S. and Hellström, T. (2014). Towards proactive robot behavior based on incremental language analysis. Proc. 2014 Workshop on Multimodal, Multi-Party, Real-World Human-Robot Interaction, Istanbul, Turkey, 2014.

[20] Bensch, S., Drewes, F. and Hellström T. (2015) Grammatical Inference of Graph Transformation Rules. Proc. 7th Workshop on Non-Classical Models of Automata and Applications (NCMA 2015).