OTM-HC: Enhanced Skeleton-Based Action Representation via One-to-Many Hierarchical Contrastive Learning

Muhammad Usman, Wenming Cao, Zhao Huang, Jianqi Zhong, Ruiya Ji*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Downloads (Pure)

Abstract

Human action recognition has become crucial in computer vision, with growing applications in surveillance, human–computer interaction, and healthcare. Traditional approaches often use broad feature representations, which may miss subtle variations in timing and movement within action sequences. Our proposed One-to-Many Hierarchical Contrastive Learning (OTM-HC) framework maps the input into multi-layered feature vectors, creating a hierarchical contrast representation that captures various granularities within a human skeleton sequence temporal and spatial domains. Using sequence-to-sequence (Seq2Seq) transformer encoders and downsampling modules, OTM-HC can distinguish between multiple levels of action representations, such as instance, domain, clip, and part levels. Each level contributes significantly to a comprehensive understanding of action representations. The OTM-HC model design is adaptable, ensuring smooth integration with advanced Seq2Seq encoders. We tested the OTM-HC framework across four datasets, demonstrating improved performance over state-of-the-art models. Specifically, OTM-HC achieved improvements of 0.9% and 0.6% on NTU60, 0.4% and 0.7% on NTU120, and 0.7% and 0.3% on PKU-MMD I and II, respectively, surpassing previous leading approaches across these datasets. These results showcase the robustness and adaptability of our model for various skeleton-based action recognition tasks.
Original languageEnglish
Pages (from-to)2170-2186
Number of pages17
JournalAI (Switzerland)
Volume5
Issue number4
DOIs
Publication statusPublished - 1 Nov 2024

Keywords

  • skeleton-based action representation learning
  • unsupervised learning
  • hierarchical contrastive learning
  • one-to-many

Cite this