Parameter Efficient Fine-Tuning for Multi-modal Generative Vision Models with Möbius-Inspired Transformation

Haoran Duan, Shuai Shao, Bing Zhai*, Tejal Shah, Jungong Han, Rajiv Ranjan

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Downloads (Pure)

Abstract

The rapid development of multimodal generative vision models has drawn scientific curiosity. Notable advancements, such as OpenAI’s ChatGPT and Stable Diffusion, demonstrate the potential of combining multimodal data for generative content. Nonetheless, customising these models to specific domains or tasks is challenging due to computational costs and data requirements. Conventional fine-tuning methods take redundant processing resources, motivating the development of parameter-efficient fine-tuning technologies such as adapter module, low-rank factorization and orthogonal fine-tuning. These solutions selectively change a subset of model parameters, reducing learning needs while maintaining high-quality results. Orthogonal fine-tuning, regarded as a reliable technique, preserves semantic linkages in weight space but has limitations in its expressive powers. To better overcome these constraints, we provide a simple but innovative and effective transformation method inspired by Möbius geometry, which replaces conventional orthogonal transformations in parameter-efficient fine-tuning. This strategy improved fine-tuning’s adaptability and expressiveness, allowing it to capture more data patterns. Our strategy, which is supported by theoretical understanding and empirical validation, outperforms existing approaches, demonstrating competitive improvements in generation quality for key generative tasks.
Original languageEnglish
Pages (from-to)1-14
Number of pages14
JournalInternational Journal of Computer Vision
Early online date13 Mar 2025
DOIs
Publication statusE-pub ahead of print - 13 Mar 2025

Keywords

  • Generative vision model
  • Multi-modal learning
  • Parameter efficient fine-tuning

Cite this