This document, Deliverable D2.2 “Multimodal, human-centric perception algorithms”, outlines the main achievements of Task T2.3. The goal of this task is to develop computer vision and machine learning algorithms to analyse the scene, either from the user’s point of view, a static camera, or multiple camera setups. Additionally, this task aimed to integrate methods for analysing audio, enhancing it and obtaining the text from human speech.D1.1 SoA has been extended with newer works, crucial for Human Mesh Recovery (HMR) and image to text methods. The importance of 2D pose estimation for downstream, specifically for minimising false positives, is highlighted. Moreover, a multi view 3D pose estimation method has been investigated, combining Bird’s Eye View tracking for occlusion robustness; with architectural and training modifications it achieves increased performance compared to the baseline (e.g., +10% Average Precision on CMU Panoptic dataset). Various architectures have been developed and integrated for the task of HMR, offering solutions for full body 3D Human Pose and Shape Estimation from a monocular view. Furthermore, baseline architectures have been enhanced, offering solutions for both lightweight online estimation from a single RGB image and offline Video-Based estimation. The developed Video-Based method is more accurate and temporally consistent than the baseline, leading up to 18% improvement in acceleration error. Techniques for localising the user have been integrated, utilising neural networks for feature matching, fiducial markers for global camera pose estimation and SLAM (IMU and RGB fusion) for user tracking. To facilitate human computer interaction, a gesture recognition module has been developed.Moving on, we expand on the audio processing techniques, specifically for audio enhancement to mitigate noise in an industrial floor. Speech and emotion recognition modules enable the extraction of text from and the user’s emotional state respectively, from an audio signal. Relating to scene description, a methodology has been developed for extracting highlights from a long video presentation, allowing the user to watch on demand the most important sections of a presentation. In addition, an image to text pipeline was developed. This pipeline extracts semantic content from both image and text inputs, providing knowledge to the user in text form about a conference session/presentation themes. Having integrated a text to speech method, the corresponding audio is generated from the text description of the session, improving user experience especially for those with visual impairments. Consequently, this module is a useful asset for improving the accessibility of the users. The feasibility of Active Learning for improving algorithmic performance is also investigated and validated.Since mobile devices (e.g. mobile phones, head mounted displays, etc.) typically do not have the resources for computationally expensive calculations, we investigate methodologies for edge to cloud communication. The camera stream from mobile devices is streamed in the cloud where the analysis happens and returns the results to the mobile device for rendering. Two solutions are explored for this task; the first one is utilising Kubernetes, while the second revolves around a centralised manager, connecting to remote machines, evaluating their GPU utilisation and assigning to them specific tasks.
Iason Karakostas (Wed,) studied this question.