This repository contains the official implementation and data for "VideoPhy: Evaluating Physical Commonsense In Video Generation".
[Webpage] [Paper] [Test Dataset π€] [Leaderboard] [AutoEvaluator] [Twitter]
Illustrative figure for VideoPhy. Generated by DALLΒ·E 3.
10 Oct 2024: We have added CogVideoX-2B/5B and Dream Machine (LumaAI) human predictions to test data.
11 Oct 2024: We ran our auto-evaluator for video-text alignment and physical commonsense on the videos from the test data. We release the auto eval scores here.
Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts and styles. Due to their ability to synthesize realistic motions and render complex objects, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models. To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate a list of 688 captions that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., VideoCrafter2) and closed models (e.g., Lumiere from Google, Pika). Further, our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense. Specifically, the best performing model, Pika, generates videos that adhere to the caption and physical laws for only 19.7% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we also supplement the dataset with an auto-evaluator, VideoCon-Physics, to assess semantic adherence and physical commonsense at scale.
We evaluate 10 closed and open text-to-video generative models on VideoPhy dataset with human annotation. We report the percentage of testing prompts for which the T2V models generate videos that adhere to the conditioning caption and exhibit physical commonsense. We abbreviate semantic adherence as SA, physical commonsense as PC. SA=1 signifies that the video aligns with given text semantics, while PC=1 indicates alignment with intuitive physics. Rankings are based on SA=1, PC=1 scores.
# | Model | Source | SA=1, PC=1 | SA=1 | PC=1 |
---|---|---|---|---|---|
1 | CogVideoX-5Bπ₯ | Open | 39.6 | 63.3 | 53 |
2 | Pikaπ₯ | Closed | 19.7 | 41.1 | 36.5 |
3 | VideoCrafter2π₯ | Open | 19.0 | 48.5 | 34.6 |
4 | CogVideoX-2B | Open | 18.6 | 47.2 | 34.1 |
5 | LaVIE | Open | 15.7 | 48.7 | 28.0 |
6 | Luma Dream Machine | Closed | 13.6 | 61.9 | 21.8 |
7 | Lumiere-T2I2V | Closed | 12.5 | 48.5 | 25.0 |
8 | SVD-T2I2V | Open | 11.9 | 42.4 | 30.8 |
9 | ZeroScope | Open | 11.9 | 30.2 | 32.6 |
10 | Lumiere-T2V | Closed | 9.0 | 38.4 | 27.9 |
11 | Gen-2 (Runway) | Closed | 7.6 | 26.6 | 27.2 |
12 | OpenSora | Open | 4.9 | 18.0 | 23.5 |
We evaluate 12 closed and open text-to-video generative models on VideoPhy dataset with our own auto evaluator (see our paper for more details). We report the percentage of testing prompts for which the T2V models generate videos that adhere to the conditioning caption and exhibit physical commonsense. We abbreviate semantic adherence as SA, physical commonsense as PC. SA=1 signifies that the video aligns with given text semantics, while PC=1 indicates alignment with intuitive physics. Rankings are based on average of SA=1, PC=1 scores.
# | Model | PC=1 | SA=1 | Avg. |
---|---|---|---|---|
1 | CogVideoX-5Bπ₯ | 41 | 57 | 49 |
2 | Luma Dream Machineπ₯ | 30 | 53 | 41.5 |
3 | VideoCrafter2π₯ | 36 | 47 | 41 |
4 | LaVIE | 36 | 45 | 41 |
5 | CogVideoX-2B | 39 | 40 | 39 |
6 | Lumiere-T2I2V | 25 | 46 | 35 |
7 | SVD-T2I2V | 34 | 37 | 35 |
8 | ZeroScope | 42 | 27 | 34 |
9 | Lumiere-T2V | 31 | 35 | 33 |
10 | Pika | 33 | 25 | 29 |
11 | Gen-2 (Runway) | 31 | 26 | 29 |
12 | OpenSora | 35 | 21 | 28 |
We provide one good and one bad examples for several models in terms of physical commonsense.
Pika:
VideoCrafter2:
LaVIE:
- Creating conda environment
conda create -n videophy python=3.10
conda activate videophy
- Install Dependencies
pip install -r requirements.txt
In this work, we also propose an auto-evaluator for our dataset. Specifically, it is trained with the generated videos from the nine videos for the training prompts.
The model checkpoint is publicly available on π€ Model.
- Download the model checkpoint to your local machine.
git lfs install
git clone https://huggingface.co/videophysics/videocon_physics
- Consider the custom caption and video as present in examples folder.
- It contains a few generated videos, ending with
mp4
extension. - It contains a
examples.csv
with two columns --videopath
andcaption
.
- Run the following command to generate two multimodal templates for testing semantic adherence (text adherence or video-text alignment) and Physical commonsense:
python utils/prepare_data.py --input_csv examples/example.csv --output_folder examples/
The above command will create two files: sa_testing.csv
and physics_testing.csv
.
- Run the following command to assess semantic adherence:
CUDA_VISIBLE_DEVICES=0 python videocon/training/pipeline_video/entailment_inference.py --input_csv examples/sa_testing.csv --output_csv examples/videocon_physics_sa_testing.csv --checkpoint <dir_for_downloaded_ckpt/videocon_physics/>
The above command with output videocon_physics_sa_testing.csv
file with semantic adherence scores from videocon-physics model.
- Similarly, run this command for physical commonsense scores:
CUDA_VISIBLE_DEVICES=0 python videocon/training/pipeline_video/entailment_inference.py --input_csv examples/physics_testing.csv --output_csv examples/videocon_physics_pc_testing.csv --checkpoint <dir_for_downloaded_ckpt/videocon_physics/>
- We release the training data on Huggingface dataset - train_dataset.
- Each instance in the dataset consists of 'video_url', 'caption', 'states_of_matter', 'complexity', 'sa' (semantic adherence), 'pc' (physical commonsense).
- There are close 4600 instances in the dataset. We are in the process of getting requisite approvals for Lumiere data release for the public. We will update the training data once we have it.
-
We have to download the mPLUG-Owl-video and VideoCon model that will be finetuned with our data to get VideoCon-Physics. Specifically, download them from huggingface to your local machine mPLUG-Owl-Video and VideoCon.
-
Now, we prepare the data for model finetuning. Specifically, create the data of the form highlighted in train_example.csv. Each instance has
videopath, caption, sa, pc
wheresa
is the semantic adherence andpc
is the physical commonsense. -
Run the following command to convert the data into a form that is amenable to instruction-tuning for VideoCon.
python utils/prepare_train_data.py --input_csv examples/train_example.csv --output_folder examples/
It will generate a file named videocon_format_train.csv
. You can do the same for the validation data. In our experiments, a few 100 instances from the train data were reserved for validation.
Note: Update the videopath
in the train and val csv to absolute paths in the local machine.
-
Add relevant information to the config video.yaml. Specifically, you have to add the absolute paths to the train and val files. In addition, add the path to the base mPLUG-Owl-Video model.
-
We can set appropriate training details in the train_it.sh script. Specifically, set the location of the base mPLUG-Owl-Video model in the
pretrained-ckpt
argument andpytorch_model.bin
location of the VideoCon model in thefinetuned-ckpt
argument. Also mention the experiment name and output directorySAVE_NAME
. Note that the LoRA parameters are identical to that of VideoCon, so that the VideoCon checkpoint can be loaded seamlessly. Further, we clarify that VideoCon-Physics is trained on top of VideoCon. -
Run the following commands to launch the training:
cd videocon/training
bash scripts/train_it.sh
-
After training, you can find the checkpoints in the
SAVE_NAME
output directories. -
To enable efficient inference, we merge and unload the LoRA modules. Specifically, we use the model_unload_merge.py script. Specifically, you will need to add the absolute location of the original mPLUG-Owl-Video and
pytorch_model.bin
of videocon-physics checkpoint at respective locations in the script. Run the following command:
python videocon/training/pipeline_video/model_unload_merge.py
This command may take a few minutes to run.
-
Now, move the
config.json
,preprocessor_config.json
,tokenizer.model
,generation_config.json
,tokenizer_config.json
to the same directory as the one where you save the merged model checkpoint (saytrained_videocon_physics_merged_output_dir
). -
Now, you can follow the instructions under the inference section using the directory of the merged model as the
checkpoint
argument.
@article{bansal2024videophy,
title={VideoPhy: Evaluating Physical Commonsense for Video Generation},
author={Bansal, Hritik and Lin, Zongyu and Xie, Tianyi and Zong, Zeshun and Yarom, Michal and Bitton, Yonatan and Jiang, Chenfanfu and Sun, Yizhou and Chang, Kai-Wei and Grover, Aditya},
journal={arXiv preprint arXiv:2406.03520},
year={2024}
}