BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models (CVPR 2024)
Fengyuan Shi,
Jiaxi Gu,
Hang Xu,
Songcun Xu,
Wei Zhang,
Limin Wang
conda create -n bivdiff python=3.10.9
conda activate bivdiff
bash install.txt
Our framework currently supports two text-to-video diffusion models (VidRD and ZeroScope) and five downstream image diffusion models (ControlNet, T2I-Adapter, InstructPix2Pix, Prompt2Prompt, Stable Diffusion Inpainting). These models may need Stable Diffusion v1.5 and Stable Diffusion v2.1 (put CLIP under Stable Diffusion 2.1 dir). All pre-trained weights are downloaded to checkpoints/ directory. The final file tree likes:
Note: remove the safetensors
checkpoints
├── stable-diffusion-v1-5
├── stable-diffusion-2-1-base
├── clip
├── ...
├── ControlNet
├── control_v11f1p_sd15_depth
├── control_v11p_sd15_canny
├── control_v11p_sd15_openpose
├── InstructPix2Pix
├── instruct-pix2pix
├── StableDiffusionInpainting
├── stable-diffusion-inpainting
├── T2IAdapter
├── t2iadapter_depth_sd15v2
├── t2iadapter_canny_sd15v2
├── VidRD
├── ModelT2V.pth
├── ZeroScope
├── zeroscope_v2_576w
├── ... (more image or video diffusion models)
You can add more image or video models according to your needs, following the instructions in here.
# 1. Controllable Video Generation
# ControlNet + VidRD
python inference.py --config-name="controllable_video_generation_with_controlnet_and_vidrd"
# ControlNet + ZeroScope
python inference.py --config-name="controllable_video_generation_with_controlnet_and_zeroscope"
# T2I-Adapter + VidRD
python inference.py --config-name="controllable_video_generation_with_t2iadapter_and_vidrd"
# 2. Video Editing
# InstructPix2Pix + VidRD
python inference.py --config-name="video_editing_with_instruct_pix2pix_and_vidrd"
# InstructPix2Pix + ZeroScope
python inference.py --config-name="video_editing_with_instruct_pix2pix_and_zeroscope"
# Prompt2Prompt + VidRD
python inference.py --config-name="video_editing_with_instruct_prompt2prompt_and_vidrd"
# 3. Video Inpainting
# StableDiffusionInpainting + VidRD
python inference.py --config-name="video_inpainting_with_stable_diffusion_inpainting_and_vidrd"
# 4. Video Outpainting
# StableDiffusionInpainting + VidRD
python inference.py --config-name="video_outpainting_with_stable_diffusion_inpainting_and_vidrd"
We decouple the implementation of image and video diffusion model, so it is easy to add your new image and video models with minor modifications. The inference pipeline in inference.py is as follows:
1. Load Models
2. Construct Pipelines
3. Read Video
4. Frame-wise Video Generation
5. Mixed Inversion
6. Video Temporal Smoothing
To add a new image diffusion model, what need to do is realize infer.py to return the video generated by the image diffusion model. First create Your_IDM directory in models folder and add infer.py under the directory. Then realize infer_Your_IDM function in infer.py. For example:
# BIVDiff/inference.py
from models.Your_IDM.infer import infer_Your_IDM
def infer(video, generator, config, latents=None):
model_name = config.Model.idm
prompt = config.Model.idm_prompt
output_path = config.Model.output_path
height = config.Model.height
width = config.Model.width
if model_name == "Your_IDM":
# return the video generated by IDM
return infer_Your_IDM(...)
# BIVDiff/models/Your_IDM/infer.py
def infer_Your_IDM(...):
idm_model = initialized_func(model_path)
# return n-frame video
frames = [idm_model(...) for i in range(video_length)]
return frames
You can generate n-frame video in a sequential way like the above codes, which is easy to implement without modifying the image model code, but time-consuming. So you can extend image diffusion model to video with inflated conv and merge temporal dimension into batch dimension (please refer to Tune-A-Video, and ControlVideo) to simultaneously generate a n-frame video faster.
For the case that your image diffusion model is difficult to integrate our framework, you can input the gif file generated by your idm.
# 4. frame-wise video generation
frames_by_idm = []
from PIL import Image
from PIL import ImageSequence
gif = Image.open("./data/your_case.gif")
i = 0
for frame in ImageSequence.Iterator(gif):
frame.save("frame%d.png" % i)
frames_by_idm.append(Image.open("frame%d.png" % i).convert("RGB"))
i += 1
Note: for the simplicity of adding new image and video models for users, we adopt a sequential way to generate video with image models in the open-source repo. However, we modified ControlNet and InstructPix2Pix for parallel generation in the old repo. We have provided the videos generated in this parallel manner (in BIVDiff/data/results_parallel), and you can use above method to reproduce the results of ControlNet and InstructPix2Pix.
To add a new video diffusion model, what need to do is provide U-Net, text encoder and text tokenizer of the video diffusion models in 1. Load Models step of inference.py. For example:
# BIVDiff/inference.py
video_unet = None
video_text_encoder = None
video_text_tokenizer = None
if config.Model.vdm == "Your_Model_Name":
vdm_model = initialized_func(model_path)
video_unet = vdm_model.unet
video_text_encoder = vdm_model.text_encoder
video_text_tokenizer = vdm_model.text_tokenizer
# add config.yaml in configs folder
hydra:
output_subdir: null
run:
dir: .
VDM:
vdm_name
other settings of vdm_name :
IDM:
idm_name
mixing_ratio: # [0, 1]
other settings of vdm_name :
MixedInversion:
idm: "SD-v15"
pretrained_model_path: "./checkpoints/stable-diffusion-v1-5"
Model:
idm: idm_name
vdm: vdm_name
idm_prompt :
vdm_prompt :
video_path:
output_path: "./outputs/"
# compatible with IDM and VDM
video_length: 8
width: 512
height: 512
frame_rate: 4
seed: 42
guidance_scale: 7.5
num_inference_steps: 50
Depth | Canny | Pose |
"A person on a motorcycle does a burnout on a frozen lake" | "A silver jeep car is moving on the winding forest road" | "An astronaut moonwalks on the moon" |
Depth | Canny | Pose |
"A brown spotted cow is walking in heavy rain" | "A bear walking through a snow mountain" | "Iron Man moonwalks in the desert" |
"A red car moves in front of buildings" | "Iron Man moonwalks on the beach" |
Source Video | Style Transfer | Replace Object | Replace Background |
"A man is moonwalking" | "Make it Minecraft" | "Replace the man with Spider Man" | "Change the background to stadium" |
"Replace the man with a little boy" | "Make it minecraft style" |
Erase Object | Replace Object | ||||
Source Video | Mask | Output Video | Source Video | Mask | Output Video |
"" | "A sports car is moving on the road" |
Source Video | Masked Video | Output Video |
If you make use of our work, please cite our paper.
@InProceedings{Shi_2024_CVPR,
author = {Shi, Fengyuan and Gu, Jiaxi and Xu, Hang and Xu, Songcen and Zhang, Wei and Wang, Limin},
title = {BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {7393-7402}
}
This work repository borrows heavily from Diffusers, Tune-A-Video, and ControlVideo. Thanks for their contributions!