GitHub - MCG-NJU/BIVDiff: [CVPR 2024] BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models (CVPR 2024)

Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcun Xu, Wei Zhang, Limin Wang

Given an image diffusion model (IDM) for a specific image synthesis task, and a text-to-video diffusion foundation model (VDM), our model can perform training-free video synthesis, by bridging IDM and VDM with Mixed Inversion.
Method

BIVDiff pipeline. Our framework consists of three components, including Frame-wise Video Generation, Mixed Inversion, and Video Temporal Smoothing. We first use the image diffusion model to do frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion model for video temporal smoothing.
Setup

1. Requirements

conda create -n bivdiff python=3.10.9 conda activate bivdiff bash install.txt

2. Download Weights

Our framework currently supports two text-to-video diffusion models (VidRD and ZeroScope) and five downstream image diffusion models (ControlNet, T2I-Adapter, InstructPix2Pix, Prompt2Prompt, Stable Diffusion Inpainting). These models may need Stable Diffusion v1.5 and Stable Diffusion v2.1 (put CLIP under Stable Diffusion 2.1 dir). All pre-trained weights are downloaded to checkpoints/ directory. The final file tree likes:

Note: remove the safetensors

checkpoints ├── stable-diffusion-v1-5 ├── stable-diffusion-2-1-base ├── clip ├── ... ├── ControlNet ├── control_v11f1p_sd15_depth ├── control_v11p_sd15_canny ├── control_v11p_sd15_openpose ├── InstructPix2Pix ├── instruct-pix2pix ├── StableDiffusionInpainting ├── stable-diffusion-inpainting ├── T2IAdapter ├── t2iadapter_depth_sd15v2 ├── t2iadapter_canny_sd15v2 ├── VidRD ├── ModelT2V.pth ├── ZeroScope ├── zeroscope_v2_576w ├── ... (more image or video diffusion models)

You can add more image or video models according to your needs, following the instructions in here.

Inference

# 1. Controllable Video Generation # ControlNet + VidRD python inference.py --config-name="controllable_video_generation_with_controlnet_and_vidrd" # ControlNet + ZeroScope python inference.py --config-name="controllable_video_generation_with_controlnet_and_zeroscope" # T2I-Adapter + VidRD python inference.py --config-name="controllable_video_generation_with_t2iadapter_and_vidrd" # 2. Video Editing # InstructPix2Pix + VidRD python inference.py --config-name="video_editing_with_instruct_pix2pix_and_vidrd" # InstructPix2Pix + ZeroScope python inference.py --config-name="video_editing_with_instruct_pix2pix_and_zeroscope" # Prompt2Prompt + VidRD python inference.py --config-name="video_editing_with_instruct_prompt2prompt_and_vidrd" # 3. Video Inpainting # StableDiffusionInpainting + VidRD python inference.py --config-name="video_inpainting_with_stable_diffusion_inpainting_and_vidrd" # 4. Video Outpainting # StableDiffusionInpainting + VidRD python inference.py --config-name="video_outpainting_with_stable_diffusion_inpainting_and_vidrd"

Add New Diffusion Models

We decouple the implementation of image and video diffusion model, so it is easy to add your new image and video models with minor modifications. The inference pipeline in inference.py is as follows:

1. Load Models 2. Construct Pipelines 3. Read Video 4. Frame-wise Video Generation 5. Mixed Inversion 6. Video Temporal Smoothing

Image Diffusion Models

To add a new image diffusion model, what need to do is realize infer.py to return the video generated by the image diffusion model. First create Your_IDM directory in models folder and add infer.py under the directory. Then realize infer_Your_IDM function in infer.py. For example:

# BIVDiff/inference.py from models.Your_IDM.infer import infer_Your_IDM def infer(video, generator, config, latents=None): model_name = config.Model.idm prompt = config.Model.idm_prompt output_path = config.Model.output_path height = config.Model.height width = config.Model.width if model_name == "Your_IDM": # return the video generated by IDM return infer_Your_IDM(...) # BIVDiff/models/Your_IDM/infer.py def infer_Your_IDM(...): idm_model = initialized_func(model_path) # return n-frame video frames = [idm_model(...) for i in range(video_length)] return frames

You can generate n-frame video in a sequential way like the above codes, which is easy to implement without modifying the image model code, but time-consuming. So you can extend image diffusion model to video with inflated conv and merge temporal dimension into batch dimension (please refer to Tune-A-Video, and ControlVideo) to simultaneously generate a n-frame video faster.

For the case that your image diffusion model is difficult to integrate our framework, you can input the gif file generated by your idm.

# 4. frame-wise video generation frames_by_idm = [] from PIL import Image from PIL import ImageSequence gif = Image.open("./data/your_case.gif") i = 0 for frame in ImageSequence.Iterator(gif): frame.save("frame%d.png" % i) frames_by_idm.append(Image.open("frame%d.png" % i).convert("RGB")) i += 1

Note: for the simplicity of adding new image and video models for users, we adopt a sequential way to generate video with image models in the open-source repo. However, we modified ControlNet and InstructPix2Pix for parallel generation in the old repo. We have provided the videos generated in this parallel manner (in BIVDiff/data/results_parallel), and you can use above method to reproduce the results of ControlNet and InstructPix2Pix.

Video Diffusion Models

To add a new video diffusion model, what need to do is provide U-Net, text encoder and text tokenizer of the video diffusion models in 1. Load Models step of inference.py. For example:

# BIVDiff/inference.py video_unet = None video_text_encoder = None video_text_tokenizer = None if config.Model.vdm == "Your_Model_Name": vdm_model = initialized_func(model_path) video_unet = vdm_model.unet video_text_encoder = vdm_model.text_encoder video_text_tokenizer = vdm_model.text_tokenizer

Config File

# add config.yaml in configs folder hydra: output_subdir: null run: dir: . VDM: vdm_name other settings of vdm_name : IDM: idm_name mixing_ratio: # [0, 1] other settings of vdm_name : MixedInversion: idm: "SD-v15" pretrained_model_path: "./checkpoints/stable-diffusion-v1-5" Model: idm: idm_name vdm: vdm_name idm_prompt : vdm_prompt : video_path: output_path: "./outputs/" # compatible with IDM and VDM video_length: 8 width: 512 height: 512 frame_rate: 4 seed: 42 guidance_scale: 7.5 num_inference_steps: 50

Results

Controllable Video Generation (ControlNet + VidRD)

Depth Canny Pose

"A person on a motorcycle does a burnout on a frozen lake" "A silver jeep car is moving on the winding forest road" "An astronaut moonwalks on the moon"

Controllable Video Generation (ControlNet + ZeroScope)

Depth Canny Pose

"A brown spotted cow is walking in heavy rain" "A bear walking through a snow mountain" "Iron Man moonwalks in the desert"

Controllable Video Generation (T2I-Adapter + VidRD)

"A red car moves in front of buildings" "Iron Man moonwalks on the beach"

Video Editing (InstructPix2Pix + VidRD)

Source Video Style Transfer Replace Object Replace Background

"A man is moonwalking" "Make it Minecraft" "Replace the man with Spider Man" "Change the background to stadium"

Video Editing (InstructPix2Pix + ZeroScope)

"Replace the man with a little boy" "Make it minecraft style"

Video Editing (Prompt2Prompt + VidRD)

Source Video Attention Replace Attention Refine

"A car is moving on the road" "A ~~car~~ bicycle is moving on the road" "A car is moving on the road at sunset" "A car is moving on the road in the forest" "A white car is moving on the road"

Video Inpainting (Stable Diffusion Inpainting + VidRD)

Erase Object Replace Object

Source Video Mask Output Video Source Video Mask Output Video

"" "A sports car is moving on the road"

Video Outpainting (Stable Diffusion Inpainting + VidRD)

Source Video Masked Video Output Video

Citation

If you make use of our work, please cite our paper.

@InProceedings{Shi_2024_CVPR, author = {Shi, Fengyuan and Gu, Jiaxi and Xu, Hang and Xu, Songcen and Zhang, Wei and Wang, Limin}, title = {BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {7393-7402} }

Acknowledgement

This work repository borrows heavily from Diffusers, Tune-A-Video, and ControlVideo. Thanks for their contributions!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
configs		configs
data		data
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
install.txt		install.txt
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models (CVPR 2024)

Method

Setup

1. Requirements

2. Download Weights

Inference

Add New Diffusion Models

Image Diffusion Models

Video Diffusion Models

Config File

Results

Controllable Video Generation (ControlNet + VidRD)

Controllable Video Generation (ControlNet + ZeroScope)

Controllable Video Generation (T2I-Adapter + VidRD)

Video Editing (InstructPix2Pix + VidRD)

Video Editing (InstructPix2Pix + ZeroScope)

Video Editing (Prompt2Prompt + VidRD)

Video Inpainting (Stable Diffusion Inpainting + VidRD)

Video Outpainting (Stable Diffusion Inpainting + VidRD)

Citation

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

Depth	Canny	Pose

"A person on a motorcycle does a burnout on a frozen lake"	"A silver jeep car is moving on the winding forest road"	"An astronaut moonwalks on the moon"


"A red car moves in front of buildings"	"Iron Man moonwalks on the beach"

Source Video	Style Transfer	Replace Object	Replace Background

"A man is moonwalking"	"Make it Minecraft"	"Replace the man with Spider Man"	"Change the background to stadium"


"Replace the man with a little boy"	"Make it minecraft style"

Source Video	Attention Replace	Attention Refine

"A car is moving on the road"	"A ~~car~~ bicycle is moving on the road"	"A car is moving on the road at sunset"	"A car is moving on the road in the forest"	"A white car is moving on the road"

Erase Object			Replace Object
Source Video	Mask	Output Video	Source Video	Mask	Output Video

""			"A sports car is moving on the road"

License

MCG-NJU/BIVDiff

Folders and files

Latest commit

History

Repository files navigation

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models (CVPR 2024)

Method

Setup

1. Requirements

2. Download Weights

Inference

Add New Diffusion Models

Image Diffusion Models

Video Diffusion Models

Config File

Results

Controllable Video Generation (ControlNet + VidRD)

Controllable Video Generation (ControlNet + ZeroScope)

Controllable Video Generation (T2I-Adapter + VidRD)

Video Editing (InstructPix2Pix + VidRD)

Video Editing (InstructPix2Pix + ZeroScope)

Video Editing (Prompt2Prompt + VidRD)

Video Inpainting (Stable Diffusion Inpainting + VidRD)

Video Outpainting (Stable Diffusion Inpainting + VidRD)

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages