Skip to content

This repository contains the research and development of a fine-grained action recognition pipeline for sports action recognition. This project was developed as a part of 696DS course at University of Massachusetts Amherst, in collaboration with Microsoft Corporation.

License

Notifications You must be signed in to change notification settings

abhisheklalwani/ActionCLIP

 
 

Repository files navigation

Towards Label Efficient Fine-grained Action Recognition

This repository contains the research and development of a fine-grained action recognition pipeline for sports action recognition. This project was developed as a part of 696DS course at University of Massachusetts Amherst, in collaboration with Microsoft Corporation.

The list of collaborators can be found below.

UMass Amherst - Abhishek Lalwani, Prajakti Kapade, Nishtha Nayar, Akhil Ayyanki, Fabien Delattre

Microsoft Corporation - Apurva Gandhi, Dhruvil Gala, Soundar Srinivasan

Since our entire progress builds on top of the existing architecture of ActionCLIP, the training and the inference guidelines for our work remains the same as ActionCLIP (can be found in relevant sections below).

The dataset which we use for our research is the FineGYM dataset. It is an extremely fine-grained dataset of gymnastics which provides coarse-to-fine-grained annotations both temporally and semantically. This allows us to experiment with various degrees of granularity, while still working with the same dataset.

There are overall 4 components required to setup a training/inference pipeline for ActionCLIP on FineGYM/any other custom dataset -

  • Config File - For specifying the hyper-parameters as well as the dataset and the corresponding label-map.
    • Sample config file for FineGYM dataset can be found in the configs folder (Linked here for reference).
  • Text file specifying the data and the corresponding labels.
    • A sample file can be seen here. Every row in the file contains 3 entries -
      • Path to the folder containing all the frames of the video which is represented by the title of the folder. For example, if the video is 30 FPS and 3 second long, there will be 90 images saved in the folder, containing all the frames in the following naming format '00000.png' to 00089.png'.
      • Number of frames (in that video) // 2
      • Label for that video (Do keep in mind that FineGym provides varying level of annotations for same video, so there can be different labels for the same video depending on the granularity of the annotations, refer to the label maps below for more details).
  • A csv file specifying the label map
    • A sample file can be seen here.
  • Actual video data in the format of discrete frames saved as images (To be specified in the text file as mentioned above).

Once you have all these components, you can use training/inference instructions given below to train your own ActionCLIP model on your data.

Updates

Overview

ActionCLIP

Content

Prerequisites

The code is built with following libraries:

  • PyTorch >= 1.8
  • wandb
  • RandAugment
  • pprint
  • tqdm
  • dotmap
  • yaml
  • csv

For video data pre-processing, you may need ffmpeg.

More detail information about libraries see INSTALL.md.

Data Preparation

We need to first extract videos into frames for fast reading. Please refer to TSN repo for the detailed guide of data pre-processing. We have successfully trained on Kinetics, UCF101, HMDB51, Charades.

Updates

  • We now support single crop validation(including zero-shot) on Kinetics-400, UCF101 and HMDB51. The pretrained models see MODEL_ZOO.md for more information.
  • we now support the model-training on Kinetics-400, UCF101 and HMDB51 on 8, 16 and 32 frames. The model-training configs see configs/README.md for more information.
  • We now support the model-training on your own datasets. The detail information see configs/README.md.

Pretrained Models

Training video models is computationally expensive. Here we provide some of the pretrained models. We provide a large set of trained models in the ActionCLIP MODEL_ZOO.md.

Kinetics-400

We experiment ActionCLIP with different backbones(we choose Transf as our final visual prompt since it obtains the best results) and input frames configurations on k400. Here is a list of pre-trained models that we provide (see Table 6 of the paper). *Note that we show the 8-frame ViT-B/32 training log file in ViT32_8F_K400.log.

model n-frame top1 Acc(single-crop) top5 Acc(single-crop) checkpoint
ViT-B/32 8 78.36% 94.25% link pwd:b5ni
ViT-B/16 8 81.09% 95.49% link pwd:hqtv
ViT-B/16 16 81.68% 95.87% link pwd:dk4r
ViT-B/16 32 82.32% 96.20% link pwd:35uu

HMDB51 && UCF101

On HMDB51 and UCF101 datasets, the accuracy(k400 pretrained) is reported under the accurate setting.

HMDB51

model n-frame top1 Acc(single-crop) checkpoint
ViT-B/16 32 76.2% link

UCF101

model n-frame top1 Acc(single-crop) checkpoint
ViT-B/16 32 97.1% link

Testing

To test the downloaded pretrained models on Kinetics or HMDB51 or UCF101, you can run scripts/run_test.sh. For example:

# test
bash scripts/run_test.sh  ./configs/k400/k400_test.yaml

Zero-shot

We provide several examples to do zero-shot validation on kinetics-400, UCF101 and HMDB51.

  • To do zero-shot validation on Kinetics from CLIP pretrained models, you can run:
# zero-shot
bash scripts/run_test.sh  ./configs/k400/k400_ft_zero_shot.yaml
  • To do zero-shot validation on UCF101 and HMDB51 from Kinetics pretrained models, you need first prepare the k400 pretrained model and then you can run:
# zero-shot
bash scripts/run_test.sh  ./configs/hmdb51/hmdb_ft_zero_shot.yaml

Training

We provided several examples to train ActionCLIP with this repo:

  • To train on Kinetics from CLIP pretrained models, you can run:
# train 
bash scripts/run_train.sh  ./configs/k400/k400_train.yaml
  • To train on HMDB51 from Kinetics400 pretrained models, you can run:
# train 
bash scripts/run_train.sh  ./configs/hmdb51/hmdb_train.yaml
  • To train on UCF101 from Kinetics400 pretrained models, you can run:
# train 
bash scripts/run_train.sh  ./configs/ucf101/ucf_train.yaml

More training details, you can find in configs/README.md

Contributors

ActionCLIP is written and maintained by Mengmeng Wang and Jiazheng Xing.

Citing ActionCLIP

If you find ActionClip useful in your research, please cite our paper.

Acknowledgments

Our code is based on CLIP and STM.

About

This repository contains the research and development of a fine-grained action recognition pipeline for sports action recognition. This project was developed as a part of 696DS course at University of Massachusetts Amherst, in collaboration with Microsoft Corporation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.6%
  • Shell 1.4%