[ECCV 2024] Leveraging Temporal Contextualization for Video Action Recognition
Minji Kim†, Dongyoon Han, Taekyung Kim*, Bohyung Han*
(†Work done during an internship at NAVER AI Lab, *corresponding authors)
NAVER AI LAB
Official PyTorch implementation of the ECCV 2024 paper "Leveraging Temporal Contextualization for Video Action Recognition"
We propose a novel framework for video understanding, called Tempoally Contextualized CLIP (TC-CLIP), which leverages essential temporal information through global interactions in a spatio-temporal domain within a video. To be specific, we introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos, which 1) extracts core information from each frame, 2) connects relevant information across frames for the summarization into context tokens, and 3) leverages the context tokens for feature encoding. Furthermore, the Video-conditional Prompting (VP) module processes context tokens to generate informative prompts in the text modality. Extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition validate the effectiveness of our model. Ablation studies for TC and VP support our design choices.
- (2024/09/26): Jupyter notebook demo released. Try TC-CLIP with your custom videos!
- (2024/07/24): Code and pretrained models are released.
- (2024/07/02): TC-CLIP is accepted at ECCV 2024! 🎉
Prior works consider temporal cues during the encoding process via (a) Cross-Frame Attention with CLS token interactions or (b) Temporal Window Expansion by adding adjacent frame tokens to key-value pairs. However, the former lacks patch-level details, while the latter limits the range of temporal interactions. (c) Joint Space-Time Attention allows full interactions across all tokens, but exhibits weak discriminability due to sparse attention on the backgrounds, witnessing extrapolation challenges (See details in the paper.) (d) Temporal Contextualization (Ours) aggregates pivotal tokens from a broader range into key-value pairs, successfully focusing on informative regions across all frames.
: A novel video understanding framework that leverages holistic video information within its encoding process.
- Temporal Contextualization (TC): Unlike prior approaches that access only a limited amount of tokens, TC allows global interactions by summarizing informative tokens from the entire video into context tokens and leveraging them during the feature encoding process.
- Video-conditional Prompting (VP): Based on the summarized context tokens from the visual domain, VP generates instance-level textual prompts that compensate for the lack of textual semantics in action recognition datasets.
- Solid performance: TC-CLIP achieves stat-of-the-art performance across zero-shot, few-shot, base-to-novel, fully-supervised settings on five video action recognition benchmarks.
We use CLIP ViT-B/16 for all experiments below. All the checkpoints can be downloaded at this link.
- (LLM) denotes that the models are using LLM-rephrased category names from FROSTER. Note that experiments on the SSv2 dataset do not involve LLM-rephrasing.
- (P) denotes that the models are first pretrained on Kinetics-400 and subsequently fine-tuned on each dataset. Otherwise, models are directly fine-tuned from CLIP. See Appendix A in the paper.
Scripts | HMDB-51 | UCF-101 | Kinetics-600 | Ckpt |
---|---|---|---|---|
TC-CLIP | 54.2 ± 0.7 | 82.9 ± 0.6 | 75.8 ± 0.5 | Link |
TC-CLIP (LLM) | 56.0 ± 0.3 | 85.4 ± 0.8 | 78.1 ± 1.0 | Link |
Scripts | HMDB-51 | UCF-101 | SSv2 | Ckpt |
---|---|---|---|---|
K=2 / K=4 / K=8 / K=16 | K=2 / K=4 / K=8 / K=16 | K=2 / K=4 / K=8 / K=16 | ||
TC-CLIP | 57.3 / 62.3 / 67.3 / 68.6 | 85.9 / 89.9 / 92.5 / 94.6 | 7.3 / 8.6 / 9.3 / 14.0 | Link |
TC-CLIP (LLM) | 58.6 / 63.3 / 65.5 / 68.8 | 86.8 / 90.1 / 92.0 / 94.3 | 7.3 / 8.6 / 9.3 / 14.0 | Link |
TC-CLIP (P) | 65.3 / 68.5 / 71.4 / 73.0 | 94.1 / 95.6 / 96.6 / 97.3 | 8.7 / 10.1 / 12.1 / 15.2 | Link |
Scripts | K-400 | HMDB-51 | UCF-101 | SSv2 | Ckpt |
---|---|---|---|---|---|
Base / Novel / HM | Base / Novel / HM | Base / Novel / HM | Base / Novel / HM | ||
TC-CLIP | 78.9 / 63.6 / 70.4 | 73.3 / 54.1 / 62.2 | 95.5 / 78.0 / 85.9 | 17.5 / 13.4 / 15.2 | Link |
TC-CLIP (LLM) | 79.1 / 65.4 / 71.6 | 73.3 / 59.1 / 65.5 | 95.4 / 81.6 / 88.0 | 17.5 / 13.4 / 15.2 | Link |
TC-CLIP (P) | N/A | 79.4 / 58.3 / 67.2 | 97.5 / 84.5 / 90.5 | 19.6 / 15.6 / 17.4 | Link |
Scripts | K-400 (Top-1) | K-400 (Top-5) | Ckpt |
---|---|---|---|
TC-CLIP | 85.2 | 96.9 | Link |
Please follow the instructions in INSTALL.md.
Please follow the instructions in DATASETS.md for data preparation.
The organization of configurations in this project is outlined in CONFIG.md.
The basic usage of the commands for training and evaluation is outlined below. For detailed instructions on all experimental setup, please refer to TRAIN_EVAL.md.
For all experiments in our main paper, we provide example training commands in scripts/train folder. The basic usage of the training command is as follows:
# Basic usage:
torchrun --nproc_per_node=4 main.py -cn ${protocol} \
data=${protocol}_${dataset_name} output=${your_ckpt_saving_path} trainer=${trainer_name}
# Example:
torchrun --nproc_per_node=4 main.py -cn zero_shot \
data=zero_shot_k400 output=ckpt/zero_shot_k400_tc_clip trainer=tc_clip
Note:
- Note that there is no
--
as in Python's native argparse. - Here,
${protocol}
refers to the chosen protocol (e.g., zero_shot, few_shot),${dataset_name}
refers to the specific dataset under the chosen protocol,${your_ckpt_saving_path}
is the path where checkpoints will be saved, and${trainer_name}
is the name of the model. main_testing
function is called at the end of the training to evaluate the accuracy with the best checkpoint.
We provide example evaluation commands in scripts/eval folder. The basic usage of the evaluation command is as follows:
# Basic usage:
torchrun --nproc_per_node=4 main.py -cn ${protocol} \
data=${protocol}_${dataset_name} output=${your_result_saving_path} \
trainer=${trainer_name} eval=test resume=${ckpt_path}
# Example:
torchrun --nproc_per_node=4 main.py -cn zero_shot \
data=zero_shot_k400 output=/PATH/TO/OUTPUT \
trainer=tc_clip eval=test resume=ckpt/zero_shot_k400_tc_clip/best.pth
Note:
- Set
eval=test
oreval=val
for evaluation-only mode. - Specify the checkpoint path in
resume=${ckpt_path}
.
If you have any questions, please create an issue on this repository or contact at taekyung.k@navercorp.com and minji@snu.ac.kr.
This project is built upon ViFi-CLIP and borrowed features from FROSTER and ToMe. We sincerely thank the authors for these greate codebases.
TC-CLIP
Copyright (c) 2024-present NAVER Cloud Corp.
CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/)
If you find TC-CLIP useful in your research, please consider citing our paper:
@article{kim2024tcclip,
title={Leveraging Temporal Contextualization for Video Action Recognition},
author={Kim, Minji and Han, Dongyoon and Kim, Taekyung and Han, Bohyung},
journal={European Conference on Computer Vision (ECCV)},
year={2024}
}