This is the offical implementation for the paper titled "TextFusion: Unveiling the Power of Textual Semantics for Controllable Image Fusion". Paper Link
"To generate appropriate fusion results for a specific scenario, existing methods cannot realize it or require expensive retraining. The same goal can be achieved by simply adjusting the focused objectives of textual description in our paradigm."
- For the first time, the text modality is introduced to the image fusion field.
- A benchmark dataset.
- A textual attention assessment.
Train Set [Images&Text]: Google Drive
Train Set [Pre-gen Association Maps]: Google Drive
Test Set: Google Drive
Folder structure:
/dataset
--/IVT_train
----/ir
------/1.png
----/vis
------/1.png
----/text
------/1_1.txt
----/association
------/IVT_LLVIP_2000_imageIndex_1_textIndex_1
--------/Final_Finetuned_BinaryInterestedMap.png
/TextFusion
--/main_trainTextFusion.py
--/net.py
--/main_test_rgb_ir.py
Assuming that you already have (download from above links) the pre-gen association map, images, and corresponding textual description in the "IVT_train" folder.
Simply run the following prompt to start the training process:
python main_trainTextFusion.py
The trained models and corresponding loss values will be saved in the "models" folder.
(The code to generate the association map on your own is available in this repository)
For the RGB and infrared image fusion (e.g., LLVIP):
python main_test_rgb_ir.py
Tips: If you are comparing our TextFusion with a pure apperance-based method, you can directly set the "description" as empty for a relative fair experiment.
For the grayscale and infrared image fusion (e.g., TNO):
python main_test_gray_ir.py
Please refer to this folder (textual_attention_metrics).
- Python 3.8.3
- Torch 2.1.2
- torchvision 0.16.2
- opencv-python 4.8.1.78
- tqdm 4.66.5
- ftfy 6.2.3
- regex
- matplotlib
- timm
- 2025-1-4: The implementation of the textual attention metric is now available.
- 2024-11-10: This work has been accepted by Information Fusion.
- 2024-4-30: The codes for generating the association maps are available now!
- 2024-3-14: The training code is available and corresponding pre-gen association maps are uploaded to the Google Drive.
- 2024-3-5: The testing set of our IVT dataset is available now.
- 2024-2-8: The training set of our IVT dataset is available now.
- 2024-2-12: The pre-trained model and test files are available now!
If you have any questions, please contact me at chunyang_cheng@163.com.
If this work is helpful to you, please cite it as:
@article{CHENG2025102790,
title = {TextFusion: Unveiling the power of textual semantics for controllable image fusion},
journal = {Information Fusion},
volume = {117},
pages = {102790},
year = {2025},
issn = {1566-2535},
doi = {https://doi.org/10.1016/j.inffus.2024.102790},
url = {https://www.sciencedirect.com/science/article/pii/S1566253524005682},
author = {Chunyang Cheng and Tianyang Xu and Xiao-Jun Wu and Hui Li and Xi Li and Zhangyong Tang and Josef Kittler},
}
or the original arxiv version:
@article{cheng2023textfusion,
title={TextFusion: Unveiling the Power of Textual Semantics for Controllable Image Fusion},
author={Cheng, Chunyang and Xu, Tianyang and Wu, Xiao-Jun and Li, Hui and Li, Xi and Tang, Zhangyong and Kittler, Josef},
journal={arXiv preprint arXiv:2312.14209},
year={2023}
}
Our dataset is annotated based on the LLVIP dataset:
@inproceedings{jia2021llvip,
title={LLVIP: A visible-infrared paired dataset for low-light vision},
author={Jia, Xinyu and Zhu, Chuang and Li, Minzhen and Tang, Wenqi and Zhou, Wenli},
booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
pages={3496--3504},
year={2021}
}