Multimodal Deep Learning for Stage Classification of Head and Neck Cancer Using Masked Autoencoders and Vision Transformers with Attention-Based Fusion

Anas Turki*, Ossama Alshabrawy, Wai Lok Woo

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

18 Downloads (Pure)

Abstract

Head and neck squamous cell carcinoma (HNSCC) is a prevalent and aggressive cancer, and accurate staging using the AJCC system is essential for treatment planning. This study aims to enhance AJCC staging by integrating both clinical and imaging data using a multimodal deep learning pipeline. We propose a framework that employs a VGG16-based masked autoencoder (MAE) for self-supervised visual feature learning, enhanced by attention mechanisms (CBAM and BAM), and fuses image and clinical features using an attention-weighted fusion network. The models, benchmarked on the HNSCC and HN1 datasets, achieved approximately 80% accuracy (four classes) and ~66% accuracy (five classes), with notable AUC improvements, especially under BAM. The integration of clinical features significantly enhances stage-classification performance, setting a precedent for robust multimodal pipelines in radiomics-based oncology applications.
Original languageEnglish
Article number2115
Number of pages14
JournalCancers
Volume17
Issue number13
DOIs
Publication statusPublished - 24 Jun 2025

Keywords

  • head and neck cancer
  • AJCC staging
  • vision transformer
  • masked autoencoder
  • multimodal fusion
  • radiomics

Cite this