Spatial Representation Learning for Human-centric Image and Video Understanding

  • Manli Zhu

Abstract

Enhancing the interpretability and spatial feature representation of deep learning models presents a critical issue in spatial representation learning. Despite the remarkable growth in hardware development that allows the efficient processing of highly complex algorithms, current deep neural networks often operate as black boxes, obscuring how features are represented and learned, which poses a significant trust issue. To bridge this gap, this thesis focuses on the improvement of interpretability and representation of spatial information, particularly in human-centric image and video understanding, aiming to make deep learning systems more transparent and robust.

This thesis makes four key contributions to spatial representation learning for human motion and action analysis. First, it introduces a spatial attention representation method designed to highlight important features identified by deep learning models, thus enhancing the interpretability of motion analysis in video data. Second, it presents a two-stream deep learning framework that incorporates multiple sets of spatial features to capture data patterns from different perspectives, achieving over 90% accuracy in gait analysis. Third, it introduces a skeleton-aware graph convolutional network to model the fine-grained spatial relationships between humans and objects in still images, resulting in a more robust system for human action understanding. Last, it proposes a geometric feature enhanced Transformer framework that unifies spatial keypoint learning across various objects for consistent and robust feature representation in detecting interactions between humans and objects, outperforming state-of-the-art methods by a significant margin.

The findings of this thesis demonstrate significant improvements in the interpretability and robustness of deep learning models for spatial feature representation across different domains and applications. It achieves over 90% accuracy in identifying abnormal human movements and surpassing state-of-the-art performance by more than 3 mAP in human action analysis, offering in-depth insights into analyzing human motion and actions. This achievement significantly impacts humancentric applications, including health monitoring, visual scene understanding, and more

By bridging critical gaps between theoretical research and practical real-world applications, this thesis advances everyday human experiences and interactions. Future research directions could explore extending the proposed methods to other domains requiring high interpretability and robustness, potentially broadening the impact of human-centric artificial intelligence applications.
Date of Award24 Oct 2024
Original languageEnglish
Awarding Institution
  • Northumbria University
SupervisorLongzhi Yang (Supervisor), Hubert Shum (Supervisor), Edmond Ho (Supervisor) & Shanfeng Hu (Supervisor)

Keywords

  • Human-object interaction
  • Attention mechanism
  • Object keypoints
  • Deep learning
  • Feature fusion

Cite this

'