Abstract
Multi-modal image fusion aims to generate a fused image that possesses the advantage of the source images in different modalities. The fused image is capable of significantly facilitating high-level vision tasks, e.g., image segmentation and object detection. However, most existing fusion methods generally focus on preserving the structure and detailed representation of the fused images while failing to integrate the high-level semantic information in the source images. To address this problem, we propose a text-guided multi-modal image fusion framework, termed Cross-Modality Interaction (CMI)-Fusion. The proposed model leverages the robust capabilities of a large-scale foundation model, i.e., Contrastive Language–Image Pre-training (CLIP), to achieve efficient interaction between image detail and text prompts. Specifically, a Dual Attention Feature Extraction (DAFE) module is derived to extract representative visual and semantic features. Moreover, a cross-modality Image-Text Interaction (ITI) module is derived to achieve a dynamic interaction between the image and corresponding text features. Extensive experiments on various multi-modal datasets demonstrate that the proposed CMI-Fusion retains image structural details and semantic content compared to the state-of-the-art methods. The code is available at https://github.com/songwenhao123/CMI-Fusion.
| Original language | English |
|---|---|
| Article number | 110073 |
| Pages (from-to) | 1-18 |
| Number of pages | 18 |
| Journal | Signal Processing |
| Volume | 237 |
| Early online date | 6 May 2025 |
| DOIs | |
| Publication status | Published - 1 Dec 2025 |
Keywords
- Image fusion
- Infrared and visible image fusion
- Medical image fusion
- Text-guided model
Fingerprint
Dive into the research topics of 'Towards text-refereed multi-modal image fusion by cross-modality interaction'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver