Towards text-refereed multi-modal image fusion by cross-modality interaction

Qilei Li, Wenhao Song, Mingliang Gao*, Wenzhe Zhai, Qiang Zhou*, Zhao Huang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

Multi-modal image fusion aims to generate a fused image that possesses the advantage of the source images in different modalities. The fused image is capable of significantly facilitating high-level vision tasks, e.g., image segmentation and object detection. However, most existing fusion methods generally focus on preserving the structure and detailed representation of the fused images while failing to integrate the high-level semantic information in the source images. To address this problem, we propose a text-guided multi-modal image fusion framework, termed Cross-Modality Interaction (CMI)-Fusion. The proposed model leverages the robust capabilities of a large-scale foundation model, i.e., Contrastive Language–Image Pre-training (CLIP), to achieve efficient interaction between image detail and text prompts. Specifically, a Dual Attention Feature Extraction (DAFE) module is derived to extract representative visual and semantic features. Moreover, a cross-modality Image-Text Interaction (ITI) module is derived to achieve a dynamic interaction between the image and corresponding text features. Extensive experiments on various multi-modal datasets demonstrate that the proposed CMI-Fusion retains image structural details and semantic content compared to the state-of-the-art methods. The code is available at https://github.com/songwenhao123/CMI-Fusion.

Original languageEnglish
Article number110073
Pages (from-to)1-18
Number of pages18
JournalSignal Processing
Volume237
Early online date6 May 2025
DOIs
Publication statusPublished - 1 Dec 2025

Keywords

  • Image fusion
  • Infrared and visible image fusion
  • Medical image fusion
  • Text-guided model

Fingerprint

Dive into the research topics of 'Towards text-refereed multi-modal image fusion by cross-modality interaction'. Together they form a unique fingerprint.

Cite this