The official code of "CSTA: CNN-based Spatiotemporal Attention for Video Summarization" [paper] [arXiv]
- Model overview
- Updates
- Requirements
- Data
- Pre-trained models
- Training
- Inference
- Generate summary videos
- Citation
- Acknowledgement
- [2024.03.24] Create a repository.
- [2024.05.21] Update the code and pre-trained models.
- [2024.07.18] Upload the code to generate summary videos, including custom videos.
- [2024.07.21] Update the KTS code for full frames of videos.
- [2024.07.23] Update the code to use only the CPU.
- [2024.12.30] Add tqdm to see the progress generating summary videos
- (Yet) [2025.01.??] Add detailed explanations and comments for the code.
Ubuntu | GPU | CUDA | cuDNN | conda | python |
---|---|---|---|---|---|
20.04.6 LTS | NVIDIA GeForce RTX 4090 | 12.1 | 8902 | 4.9.2 | 3.8.5 |
h5py | numpy | scipy | torch | torchvision | tqdm |
---|---|---|---|---|---|
3.1.0 | 1.19.5 | 1.5.2 | 2.2.1 | 0.17.1 | 4.61.0 |
conda create -n CSTA python=3.8.5
conda activate CSTA
git clone https://github.com/thswodnjs3/CSTA.git
cd CSTA
pip install -r requirements.txt
Link: Dataset
H5py format of two benchmark video summarization preprocessed datasets (SumMe, TVSum).
You should download datasets and put them in data/
directory.
The structure of the directory must be like below.
├── data
└── eccv16_dataset_summe_google_pool5.h5
└── eccv16_dataset_tvsum_google_pool5.h5
You can see the details of both datasets below.
Link: Weights
You can download our pre-trained weights of CSTA.
There are 5 weights for the SumMe dataset and the other 5 for the TVSum dataset(1 weight for each split).
As shown in the paper, we tested everything 10 times (without fixation of seed) but only uploaded a single model as a representative for your convenience.
The uploaded weight is acquired when the seed is 123456, and the result is almost identical to our paper.
You should put 5 weights of the SumMe in weights/SumMe
and the other 5 weights of the TVSum in weights/TVSum
.
The structure of the directory must be like below.
├── weights
└── SumMe
├── split1.pt
├── split2.pt
├── split3.pt
├── split4.pt
├── split5.pt
└── TVSum
├── split1.pt
├── split2.pt
├── split3.pt
├── split4.pt
├── split5.pt
You can train the final version of our models by command below.
python train.py
Detailed explanations for all configurations will be updated later.
As shown in the paper, we tested every experiment 10 times without fixation of the seed, so we can't be sure which seeds export the same results.
Even though you set the seed 123456, which is the same as our pre-trained models, it may result in different results due to the non-deterministic property of the Adaptive Average Pooling layer.
Based on my knowledge, non-deterministic operations produce random results even with the same seed. You can see details here.
However, you can get similar results with the pre-trained models when you set the seed as 123456, so I hope this will be helpful for you.
Back to top↑
You can see the final performance of the models by command below.
python inference.py
All weight files should be located in the position I said above.
Back to top↑
You can generate summary videos using our models.
You can use either videos from public datasets or custom videos.
With the code below, you can apply our pre-trained models to raw videos to produce summary videos.
python generate_video.py --input_is_file True or False
--file_path 'path to input video'
--dir_path 'directory of input videos'
--ext 'video file extension'
--save_path 'path to save summary video'
--weight_path 'path to loaded weights'
e.g.
1)Using a directory
python generate_video.py --input_is_file False --dir_path './videos' --ext 'mp4' --save_path './summary_videos' --weight_path './weights/SumMe/split4.pt'
2)Using a single video file
python generate_video.py --input_is_file True --file_path './videos/Jumps.mp4' --save_path './summary_videos' --weight_path './weights/SumMe/split4.pt'
The explanation of the arguments is as follows.
If you change the 'ext' argument and input a directory of videos, you must modify the 'fourcc' variable in the 'produce_video' function within the 'generate_video.py' file.
Additionally, you must update this when inputting a single video file with different extensions other than 'mp4'.
1. input_is_file (bool): True or False
Indicates whether the input is a file or a directory.
If this is True, the 'file_path' argument is required.
If this is False, the 'dir_path' and 'ext' arguments are required.
2. file_path (str) e.g. './SumMe/Jumps.mp4'
The path of the video file.
This is only used when 'input_is_file' is True.
3. dir_path (str) e.g. './SumMe'
The path of the directory where video files are located.
This is only used when 'input_is_file' is False.
4. ext (str) e.g. 'mp4'
The file extension of the video files.
This is only used when 'input_is_file' is False.
5. sample_rate (int) e.g. 15
The interval between selected frames in a video.
For example, if the video has 30 fps, it will become 2 fps with a sample_rate of 15.
6. save_path (str) e.g. './summary_videos'
The path where the summary videos are saved.
7. weight_path (str) e.g. './weights/SumMe/split4.pt'
The path where the model weights are loaded from.
We referenced the KTS code from DSNet.
However, they applied KTS to downsampled videos (2 fps), which can result in different shot change points and sometimes make it impossible to summarize videos.
We revised it to calculate change points based on the entire frames.
Back to top↑
If you find our code or our paper useful, please click [★star] for this repo and [cite] the following paper:
@inproceedings{son2024csta,
title={CSTA: CNN-based Spatiotemporal Attention for Video Summarization},
author={Son, Jaewon and Park, Jaehun and Kim, Kwangsu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={18847--18856},
year={2024}
}
We especially, sincerely appreciate the authors of PosENet, RR-STG who responded to our requests very kindly.
Below are the papers we referenced for the code.
A2Summ - paper, code
CA-SUM - paper, code
DSNet - paper, code
iPTNet - paper
MSVA - paper, code
PGL-SUM - paper, code
PosENet - paper, code
RR-STG - paper
SSPVS - paper, code
STVT - paper, code
VASNet - paper, code
VJMHT - paper, code
@inproceedings{he2023a2summ,
title = {Align and Attend: Multimodal Summarization with Dual Contrastive Losses},
author={He, Bo and Wang, Jun and Qiu, Jielin and Bui, Trung and Shrivastava, Abhinav and Wang, Zhaowen},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023}
}
@inproceedings{10.1145/3512527.3531404,
author = {Apostolidis, Evlampios and Balaouras, Georgios and Mezaris, Vasileios and Patras, Ioannis},
title = {Summarizing Videos Using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames},
year = {2022},
isbn = {9781450392389},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3512527.3531404},
doi = {10.1145/3512527.3531404},
pages = {407-415},
numpages = {9},
keywords = {frame diversity, frame uniqueness, concentrated attention, unsupervised learning, video summarization},
location = {Newark, NJ, USA},
series = {ICMR '22}
}
@article{zhu2020dsnet,
title={DSNet: A Flexible Detect-to-Summarize Network for Video Summarization},
author={Zhu, Wencheng and Lu, Jiwen and Li, Jiahao and Zhou, Jie},
journal={IEEE Transactions on Image Processing},
volume={30},
pages={948--962},
year={2020}
}
@inproceedings{jiang2022joint,
title={Joint video summarization and moment localization by cross-task sample transfer},
author={Jiang, Hao and Mu, Yadong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={16388--16398},
year={2022}
}
@article{ghauri2021MSVA,
title={SUPERVISED VIDEO SUMMARIZATION VIA MULTIPLE FEATURE SETS WITH PARALLEL ATTENTION},
author={Ghauri, Junaid Ahmed and Hakimov, Sherzod and Ewerth, Ralph},
Conference={IEEE International Conference on Multimedia and Expo (ICME)},
year={2021}
}
@INPROCEEDINGS{9666088,
author = {Apostolidis, Evlampios and Balaouras, Georgios and Mezaris, Vasileios and Patras, Ioannis},
title = {Combining Global and Local Attention with Positional Encoding for Video Summarization},
booktitle = {2021 IEEE International Symposium on Multimedia (ISM)},
month = {December},
year = {2021},
pages = {226-234}
}
@InProceedings{islam2020position,
title={How much Position Information Do Convolutional Neural Networks Encode?},
author={Islam, Md Amirul and Jia, Sen and Bruce, Neil},
booktitle={International Conference on Learning Representations},
year={2020}
}
@article{zhu2022relational,
title={Relational reasoning over spatial-temporal graphs for video summarization},
author={Zhu, Wencheng and Han, Yucheng and Lu, Jiwen and Zhou, Jie},
journal={IEEE Transactions on Image Processing},
volume={31},
pages={3017--3031},
year={2022},
publisher={IEEE}
}
@inproceedings{li2023progressive,
title={Progressive Video Summarization via Multimodal Self-supervised Learning},
author={Li, Haopeng and Ke, Qiuhong and Gong, Mingming and Drummond, Tom},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={5584--5593},
year={2023}
}
@article{hsu2023video,
title={Video summarization with spatiotemporal vision transformer},
author={Hsu, Tzu-Chun and Liao, Yi-Sheng and Huang, Chun-Rong},
journal={IEEE Transactions on Image Processing},
year={2023},
publisher={IEEE}
}
@misc{fajtl2018summarizing,
title={Summarizing Videos with Attention},
author={Jiri Fajtl and Hajar Sadeghi Sokeh and Vasileios Argyriou and Dorothy Monekosso and Paolo Remagnino},
year={2018},
eprint={1812.01969},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@article{li2022video,
title={Video Joint Modelling Based on Hierarchical Transformer for Co-summarization},
author={Li, Haopeng and Ke, Qiuhong and Gong, Mingming and Zhang, Rui},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2022},
publisher={IEEE}
}