Guiding Instruction-based Image Editing via Multimodal Large Language Models

This repo contains the code for Guiding Instruction-based Image Editing via Multimodal Large Language Models (ICLR'24 Spotlight)

Overview

MGIE is an implementation of
"Guiding Instruction-based Image Editing via Multimodal Large Language Models"
Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan
in International Conference on Learning Representations (ICLR) 2024

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training.

Requirements

conda create -n mgie python=3.10 -y
conda activate mgie
conda update -n base -c defaults conda setuptools -y
conda install -c conda-forge git git-lfs ffmpeg vim htop ninja gpustat -y
conda clean -a -y

pip install -U pip cmake cython==0.29.36 pydantic==1.10 numpy
pip install -U gdown pydrive2 wget jupyter jupyterlab jupyterthemes ipython
pip install -U sentencepiece transformers diffusers tokenizers datasets gradio==3.37 accelerate evaluate git+https://github.com/openai/CLIP.git
pip install -U https://download.pytorch.org/whl/cu113/torch-1.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchvision-0.13.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchaudio-0.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl
pip install -U deepspeed

# git clone this repo
cd ml_mgie
git submodule update --init --recursive
cd LLaVA
pip install -e .
pip install -U https://download.pytorch.org/whl/cu113/torch-1.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchvision-0.13.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchaudio-0.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl
pip install -U ninja flash-attn==1.0.2
pip install -U pydrive2 gdown wget

cd ..
cp mgie_llava.py LLaVA/llava/model/llava.py
cp mgie_train.py LLaVA/llava/train/train.py

Quick Start

Put official LLaVA-7B in _ckpt/LLaVA-7B-v1 and download pre-trained ckpt (on IPr2Pr + MagicBrush) in _ckpt/mgie_7b

demo.ipynb

Notices: Apple's rights in the attached weight differentials are hereby licensed under the CC-BY-NC license. Apple makes no representations with regards to LLaMa or any other third party software, which are subject to their own terms.

Usage

Data

Download CLIP-filtered IPr2Pr and process (including summarized expressive instruction) in _data

process_data.ipynb

There are examples to help prepare the data

Train

Put Vicuna-7B and LLaVA-7B in _ckpt/vicuna-7b-v1.1 and _ckpt/LLaVA-7B-v1

WANDB_DISABLED='true' torchrun --nnodes=1 --nproc_per_node=8 --master_port=7122 LLaVA/llava/train/train_mem.py --model_name_or_path ./_ckpt/vicuna-7b-v1.1 --version v1 --vision_tower openai/clip-vit-large-patch14 --mm_vision_select_layer -2 --mm_use_im_start_end True --bf16 True --output_dir _snapshot/mgie --num_train_epochs 40 --per_device_train_batch_size 4 --per_device_eval_batch_size 2 --dataloader_num_workers 2 --gradient_accumulation_steps 1 --evaluation_strategy 'no' --save_strategy 'steps' --save_steps 2000 --save_total_limit 10 --learning_rate 5e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type 'cosine' --logging_steps 1 --tf32 True --model_max_length 512 --gradient_checkpointing True --lazy_preprocess True

Inference

Extract trained ckpt in _ckpt/mgie_7b

extract_ckpt.ipynb

Run our demo

demo.ipynb

Citation

@inproceedings{fu2024mgie,
  author = {Tsu-Jui Fu and Wenze Hu and Xianzhi Du and William Yang Wang and Yinfei Yang, and Zhe Gan}, 
  title = {{Guiding Instruction-based Image Editing via Multimodal Large Language Models}}, 
  booktitle = {International Conference on Learning Representations (ICLR)}, 
  year = {2024} 
}

Acknowledgement

LLaVA: the codebase we built upon

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LLaVA @ 7ace501		LLaVA @ 7ace501
_ckpt		_ckpt
_data		_data
_input		_input
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
demo.ipynb		demo.ipynb
demo.png		demo.png
extract_ckpt.ipynb		extract_ckpt.ipynb
mgie.png		mgie.png
mgie_llava.py		mgie_llava.py
mgie_train.py		mgie_train.py
process_data.ipynb		process_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Guiding Instruction-based Image Editing via Multimodal Large Language Models

Overview

Requirements

Quick Start

Usage

Data

Train

Inference

Citation

Acknowledgement

About

Releases

Packages

Languages

License

mi-xu/ml-mgie

Folders and files

Latest commit

History

Repository files navigation

Guiding Instruction-based Image Editing via Multimodal Large Language Models

Overview

Requirements

Quick Start

Usage

Data

Train

Inference

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages