Assistive Kitchen environment based on Mixed Reality

The assignement for the 2017/18 course of Robotic Perception and Action foresee the realization of an IAT able to support people with mild cognitive impairments in everyday life, in-house, operations. For this aim, a virtual assistant was set up for supporting in food preparation. The assistant interacts with the user by means of contextualized animations able to show the procedures that must be carried out to achieve the desired result in the kitchen environment. Two main tasks are concurrent for this aim: 1) to sense the environment, i.e. user’s and objects position, user actions recognition to estimate the degree of progress; 2) to animate the assistant through the use of AR via contextualized animations. To achieve this the procedure was subdivided into different tasks with the aim to:

  • create a simple and intuitive interface that explains to the user the elementary tasks step by step;
  • detect the progress of operations and provide the subsequent ones only when the previous have been accomplished;
  • control the system that detects the user’s location and objects positions.

The technology exploited is a ToF camera-projector that allows using the kitchen surface as a touch screen while verifying the user actions. This kind of approach is relatively new because only in recent years ToF cameras have widespread and action recognition has, consequently, increased its success rate.

Sensor setup



Workpackages description

As an initial challenge, the preparation of a cup of tea was identified through the use of a sink, an induction plate, a pot, a cup and a teaspoon.
The IAT developed is composed of different modules: the Finite State Machine module receives the data coming from the other modules and triggers the interaction cues. In particular, it receives data from Sensor Fusion module about recognition and localization of objects. The Sensor Fusion combines data from the Objects Localization with ToF Camera and Objects Localization with AR-Toolkit modules to provide an optimal estimation of objects location together with its uncertainty. The Skeleton Acquisition & Action Recognition module provides to the Finite State Machine a label identifying the various gesture performed by the user. This, together with the objects location information estimated by the Sensor Fusion module allows a proper and contextualized supportive animation developed by the Animation module. In particular, the supportive actions are managed by different states that switch from one to the other according to the fact that the animated suggestion was actually performed by the user.


Overall workflow

The work is divided in six interactive modules:

  1. Finite State Machine → The finite state machine is the abstract definition of the scenes used to represent the tea operations. A finite state machine models the behavior defined by a finite number of states (unique configurations), the transitions between those states and the actions (outputs) within each state. The finite state machine is represented by a state diagram and/or state transition tables and they are an integral part of software programming. The flow chart of the finite state machine is composed of a total of 2 (initial and setting) plus 15 (operational) states that manages the different situations that the user is going to face during the ideal tea preparation sequence of actions.


    State Machine Digram Structure
  2. Skeleton Acquisition and Action Recognition → In order to understand the user action, a Microsoft’s Kinect® ToF camera was introduced to classify the following human actions: Reach, Move, Tilt, Mix and Grasp. The work was divided into four steps: data acquisition, training, performances evaluation and real-time implementation. Data are filtered and processed to obtain proper features to feed a classifier that uses the Random Forest algorithm in Matlab®. For the real-time implementation, the random forest is run as a python daemon service connected with the Matlab® processing.


    Human skeleton from Kinect
  3. Objects Localization with AR-Toolkit® → One of the most difficult parts of developing an augmented reality application is precisely calculating the user’s viewpoint in real time so that the virtual images are exactly aligned with real-world objects. ARToolKit® uses computer vision techniques to calculate the real camera position and orientation relative to square shapes or flat textured surfaces, allowing the programmer to overlay virtual objects. AR-ToolKit® read with a webcam, positioned on the top, the markers (made on black and white papers) located on the table and on the objects in order to estimate their positions with respect to the overall reference system. This was implemented in the Unity® framework to feed the Sensor Fusion module.


  4. Animations → The aim of this module is to create interactive animations that guide the user through the activity taking into account the user status and actions. Moreover, the animations developed, take into account the actual object location. In this way, animations are generated accordingly to the actual environment and not simply developed offline and run when necessary. The animations and virtual objects creations are strictly related and controlled by the finite state machine.


    Overall kitchen animation with all the objects 3D models


  5. Objects Localization with ToF Camera → The objective is to use the depth images acquired by the Microsoft’s Kinect® ToF camera in order to both recognize objects and localize them on the table. Furthermore, this module recognizes the kitchen top surface and set the global reference system accordingly. This module feeds the Sensor Fusion with both objects position and the quality of the fitted shapes. The corresponding code was implemented in Matlab®.


    Point clouds vs Real Objects
  6. Sensor Fusion → Sensor fusion is the process that combines data from a number of different information sources to provide a complete and robust (measurement) description for a set of variables of interest. The sensor fusion is particularly useful in any application where many measures must be combined, merged and optimized in order to obtain robust and high-quality information. In our case, the aim was to collect data about object positions from the Object Localization modules (AR-Toolkit® and ToF camera) in order to estimate the objects position and 2D accuracy in real time. All the measurements were considered as independent random variables, and an analysis of the variance was needed to provide robust information about the actual position and to fuse the two sensors (webcam and Microsoft’s® Kinect) in a probabilistic way. The AR-Toolkit® and the ToF camera are supposed to have a variable probability density function along the table.


    Representation of the pot position with its uncertainty

Construction phases (above) and action recognition training (below).


The result of this work was accepted to be published on the IEEE International Workshop on Metrology for Industry 4.0 and IoT – April 16-18, 2018, Brescia, Italy.

The paper published: J. D’Agostini, L. Bonetti, A. Salem, L. Passerini, G. Fiacco, P. Lavanda, E. Motti, M. Stocco, K. T. Gashay, E. G. Abebe, S. M. Alemu, R. Haghani, A. Voltolini, C. Strobbe, N. Covre, G. Santolini, M. Armellini, T. Sacchi, D. Ronchese, C. Furlan, F. Facchinato, L. Maule, P. Tomasin, A. Fornaser, M. De Cecco, An Augmented Reality virtual assistant to help mild cognitive impaired users in cooking, IEEE International Workshop on Metrology for Industry 4.0 and IoT, Brescia, Italy, April 16-18, 2018.


Here you can find the link to the paper of Hirokazu Kato (the inventor of the AR Toolkit) that (in part) inspired our project work:

Design of Assistive Tabletop Projector-Camera System for the Elderly with Cognitive and Motor Skill Impairments 


Leave a Reply