Leveraging Temporal Contextualization for Video Action Recognition

[ECCV 2024] Leveraging Temporal Contextualization for Video Action Recognition
Minji Kim†, Dongyoon Han, Taekyung Kim*, Bohyung Han*
_{(†Work done during an internship at NAVER AI Lab, *corresponding authors)

NAVER AI LAB}

Official PyTorch implementation of the ECCV 2024 paper "Leveraging Temporal Contextualization for Video Action Recognition"

Abstract

We propose a novel framework for video understanding, called Tempoally Contextualized CLIP (TC-CLIP), which leverages essential temporal information through global interactions in a spatio-temporal domain within a video. To be specific, we introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos, which 1) extracts core information from each frame, 2) connects relevant information across frames for the summarization into context tokens, and 3) leverages the context tokens for feature encoding. Furthermore, the Video-conditional Prompting (VP) module processes context tokens to generate informative prompts in the text modality. Extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition validate the effectiveness of our model. Ablation studies for TC and VP support our design choices.

Updates

(2024/09/26): Jupyter notebook demo released. Try TC-CLIP with your custom videos!
(2024/07/24): Code and pretrained models are released.
(2024/07/02): TC-CLIP is accepted at ECCV 2024! 🎉

🚀 Highlights

❗ Motivation: insufficient token interactions in recent temporal modeling approaches

Prior works consider temporal cues during the encoding process via (a) Cross-Frame Attention with CLS token interactions or (b) Temporal Window Expansion by adding adjacent frame tokens to key-value pairs. However, the former lacks patch-level details, while the latter limits the range of temporal interactions. (c) Joint Space-Time Attention allows full interactions across all tokens, but exhibits weak discriminability due to sparse attention on the backgrounds, witnessing extrapolation challenges (See details in the paper.) (d) Temporal Contextualization (Ours) aggregates pivotal tokens from a broader range into key-value pairs, successfully focusing on informative regions across all frames.

✨ Temporally Contextualized CLIP (TC-CLIP)

: A novel video understanding framework that leverages holistic video information within its encoding process.

Temporal Contextualization (TC): Unlike prior approaches that access only a limited amount of tokens, TC allows global interactions by summarizing informative tokens from the entire video into context tokens and leveraging them during the feature encoding process.
Video-conditional Prompting (VP): Based on the summarized context tokens from the visual domain, VP generates instance-level textual prompts that compensate for the lack of textual semantics in action recognition datasets.
Solid performance: TC-CLIP achieves stat-of-the-art performance across zero-shot, few-shot, base-to-novel, fully-supervised settings on five video action recognition benchmarks.

📁 Models

We use CLIP ViT-B/16 for all experiments below. All the checkpoints can be downloaded at this link.

(LLM) denotes that the models are using LLM-rephrased category names from FROSTER. Note that experiments on the SSv2 dataset do not involve LLM-rephrasing.
(P) denotes that the models are first pretrained on Kinetics-400 and subsequently fine-tuned on each dataset. Otherwise, models are directly fine-tuned from CLIP. See Appendix A in the paper.

Zero-shot action recognition

Scripts	HMDB-51	UCF-101	Kinetics-600	Ckpt
TC-CLIP	54.2 ± 0.7	82.9 ± 0.6	75.8 ± 0.5	Link
TC-CLIP (LLM)	56.0 ± 0.3	85.4 ± 0.8	78.1 ± 1.0	Link

Few-shot action recognition

Scripts	HMDB-51	UCF-101	SSv2	Ckpt
	K=2 / K=4 / K=8 / K=16	K=2 / K=4 / K=8 / K=16	K=2 / K=4 / K=8 / K=16
TC-CLIP	57.3 / 62.3 / 67.3 / 68.6	85.9 / 89.9 / 92.5 / 94.6	7.3 / 8.6 / 9.3 / 14.0	Link
TC-CLIP (LLM)	58.6 / 63.3 / 65.5 / 68.8	86.8 / 90.1 / 92.0 / 94.3	7.3 / 8.6 / 9.3 / 14.0	Link
TC-CLIP (P)	65.3 / 68.5 / 71.4 / 73.0	94.1 / 95.6 / 96.6 / 97.3	8.7 / 10.1 / 12.1 / 15.2	Link

Base-to-novel generalization

Scripts	K-400	HMDB-51	UCF-101	SSv2	Ckpt
	Base / Novel / HM	Base / Novel / HM	Base / Novel / HM	Base / Novel / HM
TC-CLIP	78.9 / 63.6 / 70.4	73.3 / 54.1 / 62.2	95.5 / 78.0 / 85.9	17.5 / 13.4 / 15.2	Link
TC-CLIP (LLM)	79.1 / 65.4 / 71.6	73.3 / 59.1 / 65.5	95.4 / 81.6 / 88.0	17.5 / 13.4 / 15.2	Link
TC-CLIP (P)	N/A	79.4 / 58.3 / 67.2	97.5 / 84.5 / 90.5	19.6 / 15.6 / 17.4	Link

Fully-supervised action recognition

Scripts	K-400 (Top-1)	K-400 (Top-5)	Ckpt
TC-CLIP	85.2	96.9	Link

🔨 Environments

Installation

Please follow the instructions in INSTALL.md.

Data preparation

Please follow the instructions in DATASETS.md for data preparation.

Configuration

The organization of configurations in this project is outlined in CONFIG.md.

💫 Training and Evaluation

The basic usage of the commands for training and evaluation is outlined below. For detailed instructions on all experimental setup, please refer to TRAIN_EVAL.md.

Training for TC-CLIP

For all experiments in our main paper, we provide example training commands in scripts/train folder. The basic usage of the training command is as follows:

# Basic usage:
torchrun --nproc_per_node=4 main.py -cn ${protocol} \
data=${protocol}_${dataset_name} output=${your_ckpt_saving_path} trainer=${trainer_name}

# Example:
torchrun --nproc_per_node=4 main.py -cn zero_shot \
data=zero_shot_k400 output=ckpt/zero_shot_k400_tc_clip trainer=tc_clip

Note:

Note that there is no -- as in Python's native argparse.
Here, ${protocol} refers to the chosen protocol (e.g., zero_shot, few_shot), ${dataset_name} refers to the specific dataset under the chosen protocol, ${your_ckpt_saving_path} is the path where checkpoints will be saved, and ${trainer_name} is the name of the model.
main_testing function is called at the end of the training to evaluate the accuracy with the best checkpoint.

Evaluation for TC-CLIP

We provide example evaluation commands in scripts/eval folder. The basic usage of the evaluation command is as follows:

# Basic usage:
torchrun --nproc_per_node=4 main.py -cn ${protocol} \
data=${protocol}_${dataset_name} output=${your_result_saving_path} \
trainer=${trainer_name} eval=test resume=${ckpt_path}

# Example:
torchrun --nproc_per_node=4 main.py -cn zero_shot \
data=zero_shot_k400 output=/PATH/TO/OUTPUT \
trainer=tc_clip eval=test resume=ckpt/zero_shot_k400_tc_clip/best.pth

Note:

Set eval=test or eval=val for evaluation-only mode.
Specify the checkpoint path in resume=${ckpt_path}.

☎️ Contact

If you have any questions, please create an issue on this repository or contact at taekyung.k@navercorp.com and minji@snu.ac.kr.

👍 Acknowledgements

This project is built upon ViFi-CLIP and borrowed features from FROSTER and ToMe. We sincerely thank the authors for these greate codebases.

🔒 License

TC-CLIP
Copyright (c) 2024-present NAVER Cloud Corp.
CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/)

📌 Citation

If you find TC-CLIP useful in your research, please consider citing our paper:

@article{kim2024tcclip,
  title={Leveraging Temporal Contextualization for Video Action Recognition},
  author={Kim, Minji and Han, Dongyoon and Kim, Taekyung and Han, Bohyung},
  journal={European Conference on Computer Vision (ECCV)},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Leveraging Temporal Contextualization for Video Action Recognition

Abstract

Updates

🚀 Highlights

❗ Motivation: insufficient token interactions in recent temporal modeling approaches

✨ Temporally Contextualized CLIP (TC-CLIP)

📁 Models

Zero-shot action recognition

Few-shot action recognition

Base-to-novel generalization

Fully-supervised action recognition

🔨 Environments

Installation

Data preparation

Configuration

💫 Training and Evaluation

Training for TC-CLIP

Evaluation for TC-CLIP

☎️ Contact

👍 Acknowledgements

🔒 License

📌 Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Leveraging Temporal Contextualization for Video Action Recognition

Abstract

Updates

🚀 Highlights

❗ Motivation: insufficient token interactions in recent temporal modeling approaches

✨ Temporally Contextualized CLIP (TC-CLIP)

📁 Models

Zero-shot action recognition

Few-shot action recognition

Base-to-novel generalization

Fully-supervised action recognition

🔨 Environments

Installation

Data preparation

Configuration

💫 Training and Evaluation

Training for TC-CLIP

Evaluation for TC-CLIP

☎️ Contact

👍 Acknowledgements

🔒 License

📌 Citation