- WP1 - Project management, dissemination and exploitation
Project Management is addressed by the first task of WP1 (Project Management) which spans the whole duration of the project. Project management aims at: (a) coordinating the joint efforts of the consortium during the execution of the project, (b) ensuring the smooth progress of the work plan and the fulfilment of the consortium’s contractual obligations, (c) providing the necessary liaisons between the consortium, the EU and other consortia, and (d) ensuring the quality of the work and the deliverables being produced.
The responsibility for all management-related activities will be carried by the INDIGO steering committee. The INDIGO steering committee will consist of the INDIGO project manager (chair), and nine members (the project heads of each partner's organization).
Dissemination and exploitation
This task aims at the best promotion of the project’s research and products by the consortium participants and perhaps other interested organizations.
All consortium partners are committed in disseminating the results of the project. More specifically:
- A fully functional project web-site will be available as soon as project month 2. The web-site will offer detailed information about the project and its goals as well as provide continuously updated access to documentation, scientific reports and project-related papers. In addition to serving as a dissemination vehicle for the project results and achievements, the project web-site will also as provide a means of communication between project partners providing access to working documents, management resources and restricted information within the project.
- The technological partners in this project will promote INDIGO developments by publishing scientific results to technical journals, conferences, workshops, etc. and to events organized by EC under relevant research networks such as euCognition, Euron and the European Robotics Platform.
- All consortium partners are likely to benefit from their participation in the project. More specifically, depending on the results of the project: (a) The participating end-user (FHW) will integrate the developed technology to it site, (b) The commercial partners will exploit the individually developed modules as well as the overall application (via appropriate commercial agreements), identifying, meanwhile, alternative possible application areas, where the system could be used. The developed technology will contribute in focusing future research and products to the market of service robotics, (c) All partners will utilize the technology and the experience gained, to other research and development projects.
- WP2 - System specifications, integration, evaluation, demonstration
WP2 spans the whole duration of the project and serves as “wrapper” workpackage that includes activities and tasks that follow the development phase of the project. More specifically, WP2 consists of 5 tasks: Operational requirements (Task 2.1), System technical specifications (Task 2.2), Integration (Task 2.3), Laboratory trials and validation (Task 2.4), and Demonstration and fielded evaluation (Task 2.5).
Tasks 2.1 and 2.2 comprise the specifications phase of INDIGO’s spiral development model, each of them conducted into two cycles. Task 2.3 is the integration task (second phase) and Tasks 2.4 and 2.5 comprise the trial/validation phase of the first and the second development cycle respectively. More specifically:
Task 2.1 Operational requirements
The task is dedicated to the specification, analysis and documentation of the operational requirements of the INDIGO system according to the end user's requirements. The end-user partner (FHW) will play a crucial role during this task and, therefore, will lead the work in this task.
Task 2.2 System technical specifications
The task will take into account the results of the previous task to specify the system's technical specifications. Given the variety of modules collaborating to achieve the desired INDIGO functionality, it is important to come up with a system design early in the project’s lifecycle. The resulting specifications will include both hardware and software parameters, as well as integration plans regarding the overall architecture of the system.
The task will also address issues related to the validation phase later in this workpackage. The intention of the consortium is to develop validation scenarios prior to any module development.
Task 2.3 Integration
During this task, the integration of the various modules that comprise the INDIGO system will take place according to the integration plan specified earlier in this workpackage.
The integration task will start early, as soon as the first results of the research workpackages are available, so that a first working prototype becomes available (even with limited functionality) for laboratory evaluation early in the project lifecycle. The integration task will proceed with what is generally known as continuous or incremental system integration. That is, the integration will continue until almost the end of the project, giving feedback to the research workpackages and integrating new modules and functionalities as they become available.
Task 2.4 Laboratory trials and validation
Extensive verification and validation tests, carried out in a laboratory environment, will provide the necessary feedback about efficiency and reliability aspects of the integrated system as well as will bring to focus potential problems. Feedback will be used to perform system tuning.
Task 2.5 Demonstration and fielded evaluation
To demonstrate the project results, the INDIGO robot will be deployed in the premises of the end-user partner FWH, where the robot will operate fully autonomously for long periods of time, providing information to guests, explaining things, showing them around, giving tours, etc.
The specific site of deployment is accessible by a significant number of visitors ranging from children to adults. It includes the Virtual Reality theatre “THOLOS”, its foyer, which is used for exhibitions, and an adjacent theatre. The building hosts cultural and scientific exhibitions, including both physical and electronic exhibitions. The robot will operate across the entire building and involve in dialogues and provide all kinds of information to the visitors (e.g., information on exhibits, information on the architectural infrastructure, information on items on sale, etc.). Moreover, FHW is hosting over the year a significant number of exhibitions covering the sectors of culture and science. INDIGO will be of great importance in providing a completely innovative way of bringing information to the visitors of these exhibitions.
The consortium will gain feedback from real visitors in real conditions, which will be used for final fine tuning of system parameters. Moreover, this feedback will be utilized to explicitly refine/revise technical WPs, in cases that the findings suggest that further work is needed in certain WPs (or tasks). For this purpose, a considerable overlap is planned between the durations of this task, the involved research WPs and the integration task. The performance of the system will be evaluated under real conditions during the demonstration, according to the validation scenarios defined in the validation plan of Task 2.2.
- WP3 - Mobile platform with facial expressions hardware
This workpackage addresses the deployment of INDIGO’s robotic platform. The main effort will be to combine the hardware and navigation software modules in a functional, yet simple and robust mobile robotic system and to make all additional developments in order to permit the additional functionality that is envisioned in INDIGO.
WP3 consists of 3 tasks:
Task 3.1 Robotic platform hardware
This task addresses the deployment of INDIGO’s robotic platform which includes the mobile base, the sensory setup, the communication links, as well as the on-board and off-board computing resources. In this context, questions of reliability, safety and efficiency are a major concern. The hardware of the system will be developed by NEOGPS and will be custom-made in order to meet the actual needs of the application and more specifically the operational, technical and aesthetic specifications set by Taks 2.1 and 2.2.
This task will first address the mechanical and electrical/electronic modules of the mobile platform. We consider using unicycle-type mobile platforms, with two independently-driven wheels, able to navigate in an indoors environment. The platforms will carry the necessary battery pack for autonomous operation, computing hardware and sensory apparatus. In addition, they will carry the necessary I/O hardware, such as an LCD touch screen, microphones, speakers, etc.
Vision, sonar and laser sensors will be used. The technical partners of the consortium have employed extensively these sensory modalities in their work on robot navigation. Cameras will be mounted on a pan-and-tilt head, in order to be able to move independently of the robotic platform (e.g. to gaze in another direction from the one that the robot is moving). Depending on the final design of the robot, if a separate pan tilt unit is considered inappropriate for aesthetic reasons, cameras mounted within the robotic head (see next section) will be used instead.
Reactive navigation, initial processing of sensory information, the control of the mobile platform (commands to the motors, interrupts generated by sensory events, etc.), user interaction, as well as the communication with the off-board workstation, will be performed onboard by a powerful PC. An additional off-board workstation will perform the heavy-duty processing of sensory data (e.g. that of visual data). The communication links between the on-board computer of the robotic platform and the off-board workstation and the Internet will also be established during this task and will be based on wireless ethernet technology and ADSL technologies respectively.
Task 3.2 Facial expressions hardware
During this task, HANSON will design and implement the INDIGO’s robotic face, according to the specifications that will be defined by Tasks 2.1 and 2.2.HANSON will utilize its expertise in order to make INDIGO’s robotic head unique in mimicking human expressions.
INDIGO’s head will possess eyes that actuate in a natural manner, lips that move in sync with speech and a neck also capable of moving naturally.
The robotic head will also include two cameras, installed behind the robotics eyes of the head and will be fully integratable to the robotic base and the other modules of the system, meeting all technical requirements set by Task 2.2. (e.g power consumption, communication protocol, etc).
The control software required for the operation of INDIGO’s head will be developed in close cooperation with the other INDIGO partners so that it will take into account information produced by other modules of the system and especially the FAP (facial information parameters) information produced by WP7. It is expected that through this cooperation and the consequent advances in technology, protocols and standards, INDIGO robots will step much further than current implementations and become unique in mimicking human expressions.
Task 3.3 Autonomous navigation
This task will develop the software modules required for effective, safe and reliable navigation of INDIGO‘s robots in the application site, achieving robot behaviours that resemble similar behaviours of humans. For this purpose, we will draw upon the expertise of the technical partners on sensor-based navigation, as well as on results and products developed by them during previous EU-funded projects to develop the navigation modules needed to allow the system to timely and safely arrive at its target positions, to avoid obstacles, to move around objects of interest, to localize itself in the environment and to update the model of the environment, if changes are detected.
Besides customizing and extending existing techniques to meet the specific requirements of the INDIGO project, innovative research will also be carried out in order to achieve (re)action behaviours that resemble similar behaviours of humans (moving at the speed of humans when guiding them, addressing a person when reacting to a question, collectively addressing a group when offering a guided tour, etc.).
The following reactive navigation modules implement the above competences:
Collision Avoidance: The collision avoidance module will flexibly react to unforeseen changes of the environment, such as people who move in front of the robot, based on input from the sonar and laser sensors. Members of the consortium have developed techniques allowing mobile robots to reliably navigate in populated environments, which employ the model of the robot system and trades off the progress towards a goal location and the free space around the robot. These techniques will be augmented by the integration and forecasting of motions of people (as detected by the corresponding component of Task 6.1) in the environment.
Map Updating Module: The map building system of the robotic avatar is used for updating the model of the environment and broadcasting the changes to all modules of the navigation system. It includes special sensor interpretation components which transform the sensory input into the environment representation used by the model of the environment. Furthermore, it suggests necessary changes of the model stored in the information base.
Localization Module: Since a robot requires knowledge about its current position in order to efficiently move to target positions, localization is a fundamental task. The goal of this component is the adaptation of the localization systems developed by partners of the consortium to the specific requirements of the environment. This may include the integration of new sensory modalities (most notably vision), if this becomes necessary, since the size and the construction (glass walls, steps, etc) of the INDIGO application site, will present a significant challenge to current localization schemes based on laser or sonar sensors. These will not be able to uniquely determine the location of the robot in such an environment. However, if vision sensors are used for localization through e.g. artificial landmarks, issues of robustness of the localization scheme in the presence of varying lighting conditions, occlusions, etc. will have to be carefully addressed.
Trajectory Tracking Module: This module is used to make the robot follow a trajectory, which can be either known a-priori or not. The trajectory to be followed is defined based on range data from the cameras, the sonars and the laser range finder.
In addition to these reactive navigation modules, the INDIGO system should be able to plan its trajectories from its current location to arbitrary target positions. Here a standard technique, developed by members of the consortium, will be employed, which has the advantage that paths can be computed very efficiently, in an on-line fashion and can quickly be adapted to situations in which the collision avoidance module chooses a detour due to the presence of unforeseen obstacles. The planning module of the INDIGO system will additionally be able to integrate detected dynamical changes into the representation of the environment. This will allow the system to quickly react to situations in which entire passages are blocked and from which it would not be able to escape otherwise.
- WP4 - Human-robot interaction management
WP4 will be broken down into two tasks. More specifically:
Task 4.1 Multimodal human-robot dialogue management
The multi-modal dialogue controller will be the heart of INDIGO's human-robot interaction technology. It will invoke the speech recognizer and the natural language interpretation module to recognize speech and extract semantic representations from the user's spoken utterances, and it will anchor these representations in the context of the physical surroundings and the discourse history, taking also into account non-verbal information contributed by the gesture and facial feature recognizers, as well as the modules that track the positions of the users. After considering both the visitors' requests and the knowledge and educational goals of the robotic guide, the dialogue controller may instruct the language generation, speech synthesis, and facial/gesture expression generation modules to present appropriate information about one or more nearby exhibits (by both verbal and non-verbal means, and with appropriate emotional signals), it may instruct those modules to prompt the user for additional information, and/or it may instruct the robotic platform to adopt a particular "posture" (e.g., while talking) or to move on to another location, possibly inviting the user to follow it.
The main underpinning for the work on multimodal human-robot dialogue will be Edinburgh University's DIPPER architecture, which was developed for the IST MagiCster project. DIPPER is not a dialogue system itself, but it supports the construction of dialogue systems by offering interfaces to components such as speech recognizers, speech synthesizers, dialogue managers, and tools for semantic interpretation. These components can run on a variety of operating systems on any number of linked computers, providing a high degree of flexibility and efficiency.
The theoretical background to the kind of dialogue management supported by DIPPER is the TRINDI project's concept of Information State and Update. The central idea is that the characterisation of a dialogue's participant's knowledge at any point in an interaction can be represented as an Information State. An utterance updates a state to produce a new one, and the Information State also forms some of the background to language generation, which in INDIGO will eventually include markup for non-verbal communication (gestures, postures, facial expressions).
The TRINDI approach of Information State and Update has proved very influential in the field of dialogue modelling, as it avoids a number of problems with earlier models. In particular, it encourages a very clear separation of various aspects of dialogue modelling, such as the user's knowledge, the representation of dialogue acts and their effects, utterance planning, interpretation, and generation. In many older systems, various subsets of these tended to be represented in a single program, which caused problems from the point of view of both the reusability of the research and the clarity of the implementations.
The theoretical challenges in INDIGO in the area of dialogue management are significant. Apart from issues related to the anchoring of spatial expressions and the merging of information coming from different input modalities, novel theoretical work is needed to support the choice of modality when generating communicative acts. (When should the robot use speech, and/or gestures, postures, facial expressions?) Furthermore, there are dialogue management issues related to INDIGO's attempt to couple natural language generation with richer user input. A text generator typically identifies in its knowledge pool (e.g., a museum's ontology/database) a large number of potential facts to be included in the output, almost always too much material to be used in a spoken cooperative dialogue. The simple solution of outputting fewer facts is unsatisfactory, since it means that some of the communicative and educational goals of the generation system would be ignored. A solution we plan to explore is to adopt the policy of saying less on each turn, but at the same time, to attempt to guide the user towards subsequent dialogue acts, in response to which, the system will be able to express the information it needs in order to achieve its current communicative and educational goals. We, therefore, plan to experiment with different ways of including `hooks' in the output, which are designed to lead users to particular follow-up dialogue acts (e.g., enquiring about particular exhibits the robot wants to talk about). These hooks may be in the linguistic content, perhaps in the form of a comparison with an object which is not known to the user. They may also be speech-specific, including the use of particular intonation patterns, and they may also be entirely visual, perhaps in the form of gestures foregrounding particular objects. INDIGO's dialogue management architecture will allow us to experiment with all of these modalities.
Task 4.2 Robotic personality and user modelling
Humans are really good at communicating feelings, thoughts, and goals, but we all differ in our communication styles, and in the way we express ourselves. These differences of expression have complex roots: our culture may tell us how to behave in a given situation; our personality, emotions, gender, and age will further shape our behaviour; we also adapt our behaviour to our social role and to the context. All these factors intervene in shaping our communicative style, our face and body postures, gestures and movement. In addition, we are all able to recognise friends from their way of talking, their movements, their gesticulations.
We aim in INDIGO to define personalised robots, that is robots that may be distinguished from each other not only by their physical aspect but also by their knowledge, habits and manner of looking, speaking and moving. Individuality may be seen on all channels of communication, from the choice of words, to the voicing and paralinguistic features, to the choice of facial expression, head and body movement. We will start by defining the dimensions of the robots' personalities:
Having established the dimensions of personality, we will define:
- the knowledge that is available to each robot (its expertise), and their educational goals;
- the repertoire of nonverbal behaviours (facial expressions, intonational patterns, gestures, body postures) each robot has at its disposition, and the tendency it has to use each;
- the manner in which the verbal and non-verbal behaviours are performed (abrupt, soft, clumsy);
- the emotional status of each robot, and how easily it is affected by external factors (e.g., visitors expressing interest or dislike).
Personalisation in INDIGO does not only involve the robot’s behaviour but also the adaptation to the user’s knowledge, habits and interaction history. INDIGO will incorporate a user authorization process, during which new visitors will identify themselves. Then, visitors will be prompted to make a number of choices concerning personalization issues, which constitute the user profile. More specifically, users will be able to choose one of the predefined user stereotypes (e.g., kid, grown up, expert), which will be set up by the application administrator. By selecting a stereotype, users will have the application tailored to their particular needs. Every stereotype will have a set of default values associated with its features (e.g., interest to particular types of information), which will aim to approach the preferences of most users in the stereotype. Those values will be actually used as defaults. INDIGO will keep track of user choices (e.g., expressions of interest to particular exhibits), and detect to what degree they deviate from stereotype definitions, updating the user model accordingly.
- a formalism to represent each robot's personality along these dimensions;
- mechanisms that will allow the dialogue controller of each robot to take into account the personality of the robot it controls (e.g., generating gestures more often, or less frequently; asking the generator to produce text with appropriate emotional markup when the robot "feels" enthusiastic, etc.);
- a model to capture gradual changes in the robot's knowledge and emotional state (e.g., keeping track of what it had told a visitor, becoming more enthusiastic when visitors show interest, etc.).
INDIGO will exploit PServer, a general-purpose personalization server developed at NCSR, to realise its user modelling strategy. Instead of re-inventing the wheel for every new situation that calls for personalized services, an application-independent approach has been taken, which separates user modelling modules from the rest of the application at both logical and physical levels. At the logical level, PServer features a flexible, domain-independent data model that is based on three principal entities: (a) users, which are represented by identifiers, (b) stereotypes, which are predefined user groups with specific attributes, and (c) features, which are application characteristics that may or may not attract user preference. In effect, features correspond to those application concepts and components, which comprise the model of the user. Applications initialise PServer by defining their own features. For every user, a leaf in the hierarchy can have a value denoting the user reaction or preference towards the particular feature. The values of all leaves constitute the profile of the user. New users can be added at any time by the application, and they can be associated with application features. Moreover, stereotypes allow the classification of users into predefined groups. At the physical level, PServer may reside at a different or at the same machine as the application. PServer is implemented as a Web server that listens to a dedicated port, and all requests have the form of HTTP messages. Responses are encoded in XML. To facilitate applications, a client-side library of classes is available that can be incorporated into the application to handle all low-level communication details. PServer has already been used successfully in a number of European and national projects, and serves as a basis for incorporating and experimenting with new features.
- WP5 - Natural language interaction
WP5 will be broken down into two tasks. More specifically:
Task 5.1 Speech recognition and synthesis
On the text to speech side, the main aim of this task will be to interpret the intonation and emotion markup that will be provided by the natural language generation module, in order to give to the robot's voice a quasi human, emotion-sensitive intonation. To express emotional connotations in conjunction with speech, we will carry out research and development on the prosodic parameterisation of high-quality limited-domain speech synthesis. This is an iterative research process, where the knowledge gained in one iteration will feed into the next iteration. Every iteration requires a new (or complementary) speech corpus covering the target domain (to be defined in line with the requirements of the interaction scenarios), in which a selection of prosodic features is systematically varied so as to span the available expressivity range. Within the unit selection TTS system that we plan to use, a selection algorithm with a well-defined and optimal target cost function including the relevant prosodic features needs to be prepared, and a set of signal processing methods must also be developed, which will be applied to the selected speech units in order to complement the prosodic control exercised by the target cost function. During the first year, a small speech corpus varying systematically in a few well-selected prosodic and voice quality features (e.g., pitch level and vocal effort) will be recorded and labelled. The current TTS unit selection algorithm from ACAPELA will be enhanced with an additional target costs function taking into account the intended prosodic and voice quality features as have been recorded in the speech corpus. Existing signal processing algorithms such as PSOLA or others will be used for post-processing the resulting speech signal. The expected result of this will be a high-quality, slightly parametrisable speech synthesiser for a very limited domain. Then the first corpus relevant to the target domain (defined in WP2) will be recorded. This target domain will need to be specified very precisely in WP2, in co-operation with all partners. Based on that specification and on the lessons learned from the first iteration, new recordings with a few well-selected prosodic and voice quality features will be carried out, and the resulting data will be labelled. Gradual improvements can be expected both in the selection algorithm (target costs) and in the signal processing methods. This will ultimately feed into a high-quality, parametrisable speech synthesiser for the robot.
On the speech recognition side, ACAPELA's speech recognition technology will be adapted in order to recognize reliably relatively simple user utterances in potentially noisy, densely populated environments. This will not be a large-vocabulary general-purpose speech recognition system, but a system with a smaller vocabulary, aimed at system-initiative task-specific dialogues. One particular area which will be investigated in this context is the use of bi-directional syntactic capabilities to promote alignment of the humans with the robot. INDIGO's grammar system, OpenCCG, can be used for both interpretation and generation, and this provides a unique opportunity to investigate a novel technique for improving speech recognition and language interpretation robustness. Psycholinguistics has shown that humans align with conversational partners, using the same words and even the same syntactic structures, and has further shown that people tend to align very strongly with non-humans. By ensuring that the robot provides examples of the kind of language it can understand through the language generation capability, INDIGO may be able to handle a wider range of input than would otherwise be possible.
To demonstrate multilingual support, speech recognition and synthesis will support two languages: English and Greek; Greek support will be demonstrated only in the final prototype.
Task 5.2 Natural language interpretation and generation
The natural language interpretation components will be responsible for extracting semantic representations from the textual forms of the user's requests, as produced at the end of speech recognition. Similarly, the natural language generation components will be responsible for generating appropriate textual renderings of the semantics the dialogue manager wishes to communicate to the user (mostly descriptions of exhibits), with additional markup that will guide the speech synthesizers, the facial expression generation hardware etc., as discussed above.
The starting point for the generation of natural language descriptions of exhibits in INDIGO will be the technology of IST M-PIRO, which represents the state-of-the-art in this particular field of language generation. The M-PIRO language generation software is currently being reimplemented at UEDIN to improve scalability and efficiency, and the resulting application will be available for use in INDIGO. Starting from language-neutral symbolic information in ontologies and databases, M-PIRO's technology produces texts in multiple natural languages, describing objects, currently exhibits in a virtual museum. The texts are automatically marked up with intonation-related information, for use by speech synthesis systems, and they are tailored to the user on various levels, including the actual content, the syntactic form, and the lexical choice.
M-PIRO's content selection module uses an ontology, possibly coupled to a database, as well as user models (e.g. user types, what each individual user has been told, etc.) to determine "what to say". A text planning module then assembles the selected facts into coherent text structures. This process can involve the inclusion of comparisons with exhibits which have already been seen, and may also employ aggregation, again tailored according to user type. Thus, for example, it is possible to ensure that long, multiply-aggregated sentences are not presented to children. A realisation module then uses language-specific knowledge to produce output in the form of XML files, which contain intonation-related markup, such as phrasal boundaries, and distinctions between information that is assumed to be new or known to the hearer. In INDIGO, the markup will also include information on the emotional status of the robot, which will affect both intonation and facial expressions, and possibly also markup that will instruct the robotic platform to make particular "gestures" or adopt particular "postures" while talking (e.g., facing or pointing to particular exhibits). To achieve this, INDIGO's language generator will have to be sensitive to each robot's personality, as discussed above, as well as to information on the robot's position and physical surroundings. The markup will also have to make clear exactly when each posture, gesture, etc. is to be made as the robot speaks, and this in turn requires theoretical work on how non-verbal and verbal content interact. Extensions to space-insensitive theories on the generation of natural language referring expressions will also be needed, to produce appropriate spatial (e.g., deictic) expressions.
As already mentioned a major limitation of the current state-of-the-art in natural language generation is that users can indicate their interests only indirectly, typically by pointing and clicking. In M-PIRO, selecting an exhibit provides the focus of the information the user wishes to receive, albeit with no indication as to exactly what information about the exhibit the user is interested in. The content selector then retrieves from the database all the information that is relevant to the selected focus, and relies on user modelling to select information that is deemed more appropriate and interesting for the user. In contrast, INDIGO will allow the users that will be interacting with the robotic guides to formulate requests in any of the supported natural languages. The interpretation of the requests will produce indicators of the information that the users wish to receive. These indicators will undergo further processing by the dialogue controller, this time taking into account the preceding discourse, the physical surroundings, and other non-verbal indicators of the user (facial features, gestures etc.), as discussed above; this requires further work to integrate the language generator with the dialogue controller.
The users' requests in INDIGO will refer to a much richer content pool, compared to most current dialogue systems. In a museum exhibition, possible requests contain questions about the history of an exhibit, the painting techniques that were used to decorate it, its use, etc. Full semantic interpretation of such a variety of possible requests is beyond the current capabilities of natural language processing technology. We believe, however, that by exploiting the context of the dialogue and domain-specific resources it is feasible to extract reasonable semantic indicators from the users' utterances, which will be enough to lead to the presentation of appropriate content, possibly after clarification sub-dialogues. More specifically, we plan to explore a shallow interpretation approach inspired by research on question-answering (QA) for document collections.
Current QA systems typically determine the type of each request (e.g., asking for a person, location, or time) by using classification techniques, and rely on the recognition of entity names (e.g., person names, names of historical periods) and phrases that suggest interest to particular attributes of entities (e.g., the creator of an entity, the creation date of an entity) in the request, in order to figure out what the user wishes to be told. We believe that it is possible to formulate a similar strategy for INDIGO, which will exploit resources that will be available for generation purposes, and information from the dialogue controller. For example, the generation resources typically include an ontology of the domain, in the form of a hierarchy of entity types, the names of the entities in the supported languages, noun synonyms that can be used to refer to each entity type, and microplans (in effect, specifications of phrases) that can express each property of the ontology . Resources of this kind can be used to identify the logical facts in the ontology a user's request relates to. Furthermore, information about the current state of the dialogue (as in TRINDI's Information States) can be exploited to help determine the most likely semantics of the user's requests. The interpretation of the user's utterances can be facilitated further by implicitly guiding the user towards requests that the robot can comprehend: as discussed above, when generating descriptions of exhibits, it is possible to include "hooks", intended to act as hints of what each robotic guide knows, and, therefore, of what types of requests it can handle. Further interpretation robustness may result from the fact that we will be using the same OpenCCG grammars in both language generation and (after appropriate grammar transformations) speech recognition, and, as discussed above, there is evidence that humans tend to align to the kinds of language other speakers, particularly non-human ones, use.
As with speech recognition and synthesis, language interpretation and generation will support two languages, English and Greek, which were already supported in M-PIRO.
- WP6 - Visual interaction
WP6 will be broken down into three tasks. More specifically:
Task 6.1 Tracking of humans
For this purpose, a state-of-the-art approach for tracking multiple skin-coloured objects will be used that has already been developed by members of the consortium in the context of the EU-IST ACTIPRET project. The proposed approach encompasses a collection of techniques that allow the modelling, detection and temporal association of skin-coloured objects across image sequences. According to INDIGO’s approach, a non-parametric model of skin colour will be employed and skin-coloured objects will be detected with a Bayesian classifier that will be bootstrapped with a small set of training data and refined through an on-line iterative training procedure. This way, by using on-line adaptation of skin-colour probabilities the classifier will be able to cope with considerable illumination and skin-tone changes. Tracking over time will be achieved by a technique that can handle multiple objects simultaneously while tracked objects may have trajectories of arbitrary complexity, occlude each other in the field of view of a possibly moving camera and vary in number over time.
- During this task, appropriate algorithms will be created in order to identify and track humans in the vicinity of the robot and, additionally, to identify and track the face and the hands of a single human that has the attention of the robot. For this purpose information will be fused from both range and visual sensors.
- To realize the tracking of humans, we will apply an extended version of the multi-hypothesis tracking paradigm. Conventional approaches typically assume that there is an individual feature for each of the objects being tracked by the system. In the context of the INDIGO system, however, the robot has to deal with potential occlusions and also with the fact that people often form groups which results in a group feature that does not provide appropriate information allowing to update the filter. The goal of the INDIGO project therefore is to develop a multi-resolution multi-hypothesis tracker that contains appropriate transition models to let people join and leave groups. According to these models we expect a significantly more robust tracker that can reliably track multiple people in the vicinity of the robot even in situations in which they form groups.
- Additionally, the robot will also be able to recognize and distinguish between the faces and the hands of the people around it. It will also be able to “lock” to a single person (in case of a dialogue), and robustly track its face and hands even if the person is moving (the robot may have to adjust the direction of its head and/or body so that it always faces the person it has a dialogue with).
Task 6.2 Human hand-gesture recognition
Humans intuitively use a hand gestures when talking to other humans. During this task, appropriate software will be developed so that INDIGO robots will be able to recognize a set of hand-gestures given by the person it has a dialogue with. The exact set of these features will be defined in Tasks 2.1 and 2.2. Example gestures may include the pointing gesture. (the user points to an exhibit and asks the robot for information, or points down and asks the robot to “come here”), the “stop” gesture (the user blocks the way of the robot and waves its hands), or the “I want your attention gesture” (the user waives it hands).
In order to achieve the goals of this task, we will utilize the output of the skin-coloured regions tracker, described in task 6.1. In order to detect the fingers of the tracked hands, a methodology developed by a partner of the consortium will be utilized which is based on calculations which take into account the curvature at each point of the contour of the detected colour blobs. By employing both cameras of the INDIGO’s stereo camera system and two distinct instances of the skin-colour tracker, each operating on a separate video stream, the proposed approach is able to recover the 3D position of the hands and the fingers of the interacting person.
In order to interpret the locations of the hands and the fingers and recognize the actual gestures, INDIGO will apply probabilistic techniques based on Rao-Blackwellized particle filters. To cope with the potential ambiguities according to the uncertainty in the sensory input, INDIGO will apply the particle filter to maintain multiple hypotheses about the potential gestures of a person over time. The individual particles of the filter will not only represent potential gestures but also potential starting and end points. To cope with sequences of movements and complex gestures, we will apply a hierarchical approach. According to the estimate of the high-level gesture corresponding to the intention of the person, we will apply specific motion models for the low-level and short-term gestures.
Task 6.3 Recognition of facial expressions and features
This task is devoted to the development of software that will enable INDIGO robots to recognize some simple facial expressions and/or features. Analysis of facial features and expressions requires a number of steps which attempt to detect or track the face; locate characteristic facial regions such as eyes, mouth, and nose on it, extract and follow their movement and, finally, interpret this motion to facial features.
For detecting and tracking the speaker face, we will utilize the output of the algorithm described in task 6.1. As a second step of the proposed approach, a probabilistic approach will be used to track facial regions between frames. For this purpose, besides local illumination and colour information, constrains imposed from the deformation of the contour of the detected facial region will be utilized as well.
State-of-the-art emotion recognition systems either use still images, trying to determine facial gestures using anatomic information about the face, or track characteristic points in these regions over time. INDIGO’s probabilistic approach will utilize state-of-the-art algorithms to track facial regions over time. Special emphasis will be given to the fact the final system must be able to provide results in real time. Given the fact that the result of the recognition system will not be critical for the operation of the system, it might be necessary to trade-off between the computational requirements of the system, and performance in terms of recognition rates and the number of recognizable features.
As a by-product of the facial expressions recognition, INDIGO robots will also be able to recognize (within some confidence interval) whether a person is talking or not. Additionally by utilization of appearance based methods on viewpoint-adjusted face regions, INDIGO robots will also be able to recognize some characteristic facial features such as whether a user has a moustache, a beard, or of he/she wears glasses. This information, depending on specific application scenarios that will be defined in Task 2.1, might be used in order to initiate dialogues customized to the particular user.
- WP7 - Virtual Emotions
The main goal of this WorkPackage is to link information coming from the Dialogue Manager to the robotic face. During this WP, expressive and emotional speech animation of the robotic face will be realized which requires the combination of several different modalities such as lip-synchronized speech, emotions, facial expressions (raising eyebrows), gaze and idle motions on the face such as blinking. A parameterized facial model will be used based on MGEG-4 Facial Definition and Animation Parameters. Facial Definition Parameters (FDPs) are feature points, such as the corner of a lip, that are used to characterize the face or, in other words, define what a face is. Facial Animation Parameters (FAPs) are used to define an animation to produce faces with speech, expression and emotions. Each FAP value corresponds to the displacement of one feature point on the face in one direction in terms of FAPUs (Facial Animation Parameter Units). FAPUs are calculated according to the given model and they are the fractions of key facial distances such as the distance between two lip corners or between eyes. Animations are specified in terms of these parameters so that an animation can be applied on another facial mesh that contains the same feature point information. Both the robotic face and virtual face will be animated according to these feature points. A mapping between FAP and controls on the robotic face is necessary in order to realize the animations.
WP7 Comprises of four tasks.
Task 7.1. Robot emotional state
A crucial part of the robotic face will be its expressivity. In order to achieve a high level of believability of the robotic face, emotions and personality parameters are required to establish a sense of individuality. In order to achieve this, we will develop an engine that maintains an emotional state and that updates it dynamically according to the personality. This personality/emotion engine is linked with the Dialogue Manager from WP5, so that the dialogue system can control the emotional state, as well as use it to generate an appropriate response. The emotional state and personality will also be used to control the expressions shown on the robotic face.
An individual will be represented as a combination of three factors at a time, personality, mood and emotion. Personality is a static property that influences the way people perceive their environment and distinguish one from another. Personality models from physcology such as the OCEAN model will be applied in the personality engine. Emotions are momentary changes that are resulting in a change in the facial expression such as being happy or sad. There is a variety of models for emotions from both phychology and neurophysiology that represent emotions on either a discrete (Ekman, OCC appraisal model) or a continuous domain (activation-evaluation) and an appropriate model will be applied in the emotion engine. Another layer between static personality and momentary emotions is mood, reflecting the state of a person for a relatively longer period of time when compared to emotions. Mood is important for more realistic emotion representation e.g a person that is in a bad mood can smile and continuous positive impulses in the emotional state can direct him/her to a positive mood. In the emotion engine all these factors will take a role in updating the other factors. For example, an emotional impulse coming from the dialogue system (with tagged text) will update the current mood of a person with consideration of both personality, emotion and mood history. Later emotional state will be updated according to the new mood, new mood history and new emotion history. Decay of emotions and turning to neutral state based on time will also be considered as a part of the system. Emotional labels produced by the engine will update the current facial expression on the robot during conversation. An emotion authoring tool will also be developed to control emotion and personality parameters and their effects on eachother.
Task 7.2. Facial authoring tools for emotive virtual to real interaction
A facial authoring tool will be developed to define the feature points on a given facial mesh and allow to change FAPs to create morph-targets which are static states of the face such as smile, raising eye-brow, happy or a visual counterpart of a phonem such as /a which are also visemes. Morph-targets that are defined in terms of parameters will be stored in an expression-viseme database for future use in Task 7.3. Authoring of face can also be done through higher-level parameters such as opening the mouth which are combinations of several FAPS and let the user to create static expressions in an easier and quicker way. These higher-level parameters can be obtained through statistical analysis of motion-captured facial animation data and finding the principal components of the face movements which is called Principal Component Analysis.
Task 7.3. Facial modalities to Facial Animation Parameters
Multi-input should be managed in order to produce animation. The animation will be constructed from a tagged text coming from the Dialogue Manager developed in WP5. The tags in the text will denote where certain expressions (such as eyebrow raising) should take place with regards to the spoken text. Secondly, a speech animation will be constructed from the text and the viseme/timing information from the text-to-speech engine developed in WP5. Finally, face idle motions (such as eye blinking) and the expression coming from the emotional state will form a part of the final animation. All these animation streams will be blended into a single animation stream in MPEG4 format, which is used to animate the virtual character and the robotic face.
Lip-synchronized speech can be produced automatically through extracting phonems from written text with the help of a text-to-speech tool. Phoneme timing information coming from a TTS system is used for mapping phonemes to visemes which are predefined and stored in a viseme database as defined in Task 7.2. Building blocks of speech animation coming from a viseme database are interpolated according to the timing information in the phoneme stream coming from the TTS system. In order to obtain realism in speech animation, simple interpolation between phonemes is not enough since each phonetic segment is influenced by its neighboring segments which is called coarticulation. Computer synthesized speech from text is good for providing accurate synchronization between speech and lip movements. However it lacks properties such as natural rhytm, articulation and intonation, which are present in natural speech. In order to create animation for natural speech, phoneme timing information produced by ACAPELA’s speech recognition tool can be used. Emotions and other facial expressions will be blended with speech animation at specific time points. A change in the emotional state is reflected to the face by choosing the appropriate emotional state from the expression-viseme database. The emotion engine is updated with the emotional impulse coming from the dialogue system in a tagged format. Immediate facial expressions that occur on the face is directly played on the face. The timing of these expressions is defined by the appropriate annotations in the dialogue text. Idle motions on the face are also produced with an idle motion engine and eye blinks can be modeled with random algorithms. A gaze model can be developed that is updated based on the regulation purposes during conversation such as providing feedback, taking turns and paying attention. This can be achieved through using tagged text just like the other immediate facial expressions. But another factor that affects gaze is emotional state and personality. A shy person will not look directly into someones eyes or a person liking another will look into eyes for a long time. All these animations are blended into a single animation in MPEG-4 format.
Task 7.4. Multimodal Facial Animation engine for virtual and robotic characters
At the last step, the final parametric animation file containing a mixture of all modalities on the face will be used to animate the robotic face. This will be achieved through a mapping between FDPs and control points on the robotic face. A FAP/Robot converter will be developed considering the mechanical constraints of the robotic face such as degree-of-freedom. The same facial animation will also be used to play a virtual face on a computer screen so that comparison between virtual and robotic face will be available.
An instrumental part of the virtual face interaction is the integrated maintenance of eye-contact between the virtual face and real-time gaze model as analysed in 7.3 and the real-participant, so that the level of Presence is ensured during the interaction. The conversion of virtual character Facial Animation Parameters to the Robotic Facial expressions will be jointly performed with the collaboration of WP3 and WP2 partners.