Skip to content

Latest commit

 

History

History
296 lines (242 loc) · 18.1 KB

README.md

File metadata and controls

296 lines (242 loc) · 18.1 KB

Key-point Detection based Online Real-Time Spatio-Temporal Action Localization

Kalana Abeywardena, Sakuna Jayasundara, Sachira Karunasena, Shechem Sumanthiran, Dr. Peshala Jayasekara, Dr. Ranga Rodrigo

Real-time and online action localization in a video is a critical yet highly challenging problem. Accurate action localization requires utilization of both temporal and spatial information. Recent attempts achieve this by using computationally intensive 3D CNN architectures or highly redundant two-stream architectures with optical flow, making them both unsuitable for real-time, online applications. To accomplish activity localization under highly challenging real-time constraints, we propose utilizing fast and efficient key-point based bounding box prediction to spatially localize actions. We then introduce a tube-linking algorithm that maintains the continuity of action tubes temporally in the presence of occlusions. Further, we eliminate the need for a two-stream architecture by combining temporal and spatial information into a cascaded input to a single network, allowing the network to learn from both types of information. Temporal information is efficiently extracted using a structural similarity index map as opposed to computationally intensive optical flow. Despite the simplicity of our approach, our lightweight end-to-end architecture achieves state-of-the-art frame-mAP on the challenging UCF101-24 dataset, demonstrating a performance gain of 6.4% over the previous best online methods. We also achieve state-of-the-art video-mAP results compared to both online and offline methods. Moreover, our model achieves a frame rate of 41.8 FPS, which is a 10.7% improvement over contemporary real-time methods.

Proposed Architecture

Highlights

  • Utilize key-point-based detection architecture for the first time for the task of ST action localization, which reduces model complexity and inference time over traditional anchor-box-based approaches.
  • Demonstrate that the explicit computation of OF is unnecessary and that the SSIM index map obtains sufficient inter-frame temporal information.
  • A single network provided with both spatial and temporal information, and allowing it to extract necessary information through discriminative learning.
  • An efficient tube-linking algorithm that extrapolates the tubes for a short period using past detections for real-time deployment.

Table of Content

  1. Installation
  2. Datasets
  3. Training CenterNet
  4. Saving Detections
  5. Online Tube Generation
  6. Performance
  7. Citation
  8. Reference

Installation

The code was tested on Ubuntu 18.04, with Anaconda Python 3.7 and PyTorch v1.4.0. NVIDIA GPUs are needed for both training and testing. After install Anaconda:

  1. [Optional but recommended] create a new conda environment and activate the environment.
conda create --name CenterNet python=3.7
conda activate CenterNet
  1. Install pytorch 1.4.0:
conda install pytorch=1.4.0 torchvision -c pytorch

Based on original repository, there can be slight reduction in performances for spatial localization with cudann batch normalization enabled. You can manually open torch/nn/functional.py and find the line with torch.batch_norm and replace the torch.backends.cudnn.enabled with False.

  1. Install COCOAPI:
# COCOAPI=/path/to/clone/cocoapi
git clone https://github.com/cocodataset/cocoapi.git $COCOAPI
cd $COCOAPI/PythonAPI
make
python setup.py install --user
  1. Install the requirements:
pip install -r requirements.txt

Datasets

We evaluate our framework on two datasets, UCF101-24 and J-HMDB21. UCF101-24 is a subset of UCF101[1] dataset with ST labels, having 3207 untrimmed videos with 24 action classes, that may contain multiple instances for the same action class. J-HMDB-21 is a subset of the HMDB51[2] dataset having 928 temporally trimmed videos with 21 actions, each containing a single action instance.

Download the datasets and extract the frames. Place the extracted frames in rgb-images in the respective dataset directory in Datasets. The data directory should look as follows:

Sample directory tree for J-HMDB21

Training CenterNet

Setting up the CenterNet

When seting up CenterNet, we followed the instructions mentioned in their official repository. The modified scripts of CenterNet for Action Detection are provided in this repository.

To install DCNv2, follow the below instructions:

  1. Build NMS
cd CenterNet\src\lib\external
#python setup.py install
python setup.py build_ext --inplace

Comment out the parameter in setup.py when building nms extension to solve invalid numeric argument /Wno-cpp: #extra_compile_args=["-Wno-cpp", "-Wno-unused-function"] (the provided script by us has made the changes already).

  1. Clone and build original DCNv2
cd CenterNet\src\lib\models\networks
rm -rf DCNv2
git clone https://github.com/CharlesShang/DCNv2

After cloning the original DCNv2, navigate to the directory and make the following changes:

cd DCNv2
vim cuda/dcn_va_cuda.cu
"""
# extern THCState *state;
THCState *state = at::globalContext().lazyInitCUDA();
"""

Finally, execute the following command to build DCNv2:

python setup.py build develop

Training from the scratch

To train from the scratch with either UCF101-24 or J-HMDB21 datasets, the following command can be run.

python CUDA_VISIBLE_DEVICES=0 python main_SMDouble.py --dataset <dataset> --gpus <gpu id> --exp_id <save dir name> --task doubleSM --num_epochs <epochs (default: 60)> --variant <variation (default: 1)> 

Resuming from saved checkpoint

To resume the training from the last checkpoint saved, run the following command.

python CUDA_VISIBLE_DEVICES=0 python main_SMDouble.py --dataset <dataset> --gpus <gpu id> --exp_id <save dir name> --task doubleSM --num_epochs <epochs (default: 60)> --variant <variation (default: 1)> --resume 

Further, to resume the training from a specific chekpoint saved, run the following command.

python CUDA_VISIBLE_DEVICES=0 python main_SMDouble.py --dataset <dataset> --gpus <gpu id> --exp_id <save dir name> --task doubleSM --num_epochs <epochs (default: 60)> --variant <variation (default: 1)> --resume --load_model <path to the saved model>

Transfer Learning using the best checkpoint

To tranfer learn from a pre-trained checkpoint, run the following command.

python CUDA_VISIBLE_DEVICES=0 python main_SMDouble.py --dataset <dataset> --gpus <gpu id> --exp_id <save dir name> --task doubleSM --num_epochs <epochs (default: 60)> --variant <variation (default: 1)> --load_model /path/to/checkpoint

The pre-trained model checkpoints trained on J-HMDB21 and UCF101-24 datasets can be downloaded from here. Place the chekpoints at ./CenterNet/exp/$DATASET_NAME/dla34/rgb/$CHKPT_NAME to be compatible with the directory path definitions in the Centernet scripts.

Saving Detections

For evaluation, the spatial detections needs to be saved as .mat files. First, navigate to ./Save Detections/ and execute the following command:

python CUDA_VISIBLE_DEVICES=0 python SaveDetections.py --dataset <dataset> --ngpu <gpu id> --exp_id <save dir name> --task doubleSM --frame_gap <default:1> --variant <default:1> --load_model /path/to/checkpoint --result_root /path/to/detections

Online Tube Generation

After the spatial detections are saved for each video, the action tubes and paths are generated using the proposed online tube generation algorithm and its variation that are based on the original implementation which is also provided for comparison. The codes can be found in ./online-tubes/.

  • To run the code, you will need to install MATLAB. You can install a free trial for testing purposes. Make sure you add the MATLAB installation path to the conda environment if you are executing scripts using command line.
  • If you only have command line priviledges, you can install Octave and execute the tube generation.

Executing with MATLAB

  1. Navigate to the respective directory:
cd ./online-tubes/EXP_without_BBOX
  1. Change the paths based on where the data (saved detections) is located and results need saving in I01onlineTubes.m and utils/initDatasetOpts.m.
  2. Execute I01onlineTubes.m. When executing using command line:
matlab -batch "I01onlineTubes.m"

Executing with Octave

  1. Navigate to the respective directory:
cd ./online-tubes/EXP_without_BBOX
  1. Change the paths based on where the data (saved detections) is located and results need saving in I01onlineTubes.m and utils/initDatasetOpts.m.
  2. Execute I01onlineTubes.m. When executing using command line:
octave I01onlineTubes.m

There can be errors when running the current scripts in Octave. This is due to -v7.3 argument used in save() function in MATLAB scripts. You can simply remove the -v7.3 argument in save() functions and run without errors.

Performance

We describe our experimental results and compare them with state-of-the-art offline and online methods that use either RGB or both RGB and OF inputs. Further, for comparison we present results on action localization using only the appearance (A) information extracted by a single frame. The results of our proposed method presented in Table demonstrate that we are able to achieve state-of-the-art performance.

ST action localization results (v-mAP) on UCF101-24 and J-HMDB21 datasets
Method UCF101-24 J-HMDB21 FPS
f-mAP
@0.5
v-mAP f-mAP
@0.5
v-mAP
0.2 0.5 0.75 0.5:0.95 0.2 0.5 0.75 0.5:0.95
Saha et al.[3] - 66.6 36.4 7.9 14.4 - 72.6 71.5 43.3 40.0 4
Peng et al.[4] 65.7 72.9 - - - 58.5 74.3 73.1 - - -
Zhang et al.[5] 67.7 74.8 46.6 16.7 21.9 37.4 - - - - 37.8
ROAD+AF[6] - 73.5 46.3 15.0 20.4 - 70.8 70.1 43.7 39.7 7
ROAD+RTF[6] - 70.2 43.0 14.5 19.2 - 66.0 63.9 35.1 34.4 28
ROAD (A)[6] - 69.8 40.9 15.5 18.7 - 60.8 59.7 37.5 33.9 40
Ours (A) 71.8 70.2 44.3 16.6 20.6 51.2 59.3 59.2 48.2 41.2 52.9
Ours 74.7 72.7 43.1 16.8 20.2 50.5 58.9 58.4 49.5 40.6 41.8

We analyze the inference times for different variations of our pipeline based on the different modules in the framework and the overall inference time in the below Table. Evidently, any preprocessing will have an impact on the inference time. Thus, the SS-map achieves a balance between the run-time and the accuracy over the other variations in the framework.

Inference Run Time Analysis
Framework Module Ours A + DSIM A + It-1 A A + RTF A + AF
Temporal INFO EXT (ms) 5.0 5.0 - - 7.0 110.0
Detection network (ms) 16.4 16.4 16.4 16.4 16.4 16.4
Tube generation time (ms) 2.5 2.5 2.5 2.5 3.0 3.0
Overall (ms) 23.9 23.9 18.9 18.9 26.4 129.4

Citation

@article{abeywardena2021korsal,
  title={KORSAL: Key-point Detection based Online Real-Time Spatio-Temporal Action Localization},
  author={Abeywardena, Kalana and Sumanthiran, Shechem and Jayasundara, Sakuna and Karunasena, Sachira and Rodrigo, Ranga and Jayasekara, Peshala},
  journal={arXiv preprint arXiv:2111.03319},
  year={2021}
}
@INPROCEEDINGS{10288973,
  author={Abeywardena, Kalana and Sumanthiran, Shechem and Jayasundara, Sakuna and Karunasena, Sachira and Rodrigo, Ranga and Jayasekara, Peshala},
  booktitle={2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE)}, 
  title={KORSAL: Key-Point Based Online Real-Time Spatio-Temporal Action Localization}, 
  year={2023},
  volume={},
  number={},
  pages={279-284},
  doi={10.1109/CCECE58730.2023.10288973}}

Reference

[1] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. \

[2] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In International Conf. on Computer Vision (ICCV), pages 3192–3199, December 2013.

[3] Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529, 2016.

[4] Xiaojiang Peng and Cordelia Schmid. Multi-region two-stream r-cnn for action detection. In ECCV, pages 744–759. Springer, 2016.

[5] Dejun Zhang, Linchao He, Zhigang Tu, Shifu Zhang, Fei Han, and Boxiong Yang. Learning motion representation for real-time spatio-temporal action localization. Pattern Recognition, 103:107312, 2020.