Implementation of "Explore-And-Match".
cuda == 10.2
torch == 1.8.0
torchvision == 0.9.0
python == 3.8.11
numpy == 1.20.3
split
- ActivityNet (train/val/test)
- Charades (train/test: 5338/1334)
-
Request videos: you need to request download with below form
https://docs.google.com/forms/d/e/1FAIpQLSeKaFq9ZfcmZ7W0B0PbEhfbTHY41GeEgwsa7WobJgGUhn4DTQ/viewform
merge 'v1-2' and 'v1-3' into a single folder 'videos'.
-
Download annotations (ActivityNet Captions)
https://cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip
-
Download videos
-
Download annotations
- C3D
- CLIP
Get 64/128/256 frames per video:
bash preprocess/get_constant_frames_per_video.sh
change 'val_1' to 'val' and 'val_2' to 'test' CLIP encodings
bash preprocess/get_clip_features.sh
activitynet
, charades
bash train_{dataset}.sh
bash test_{dataset}.sh
refer to lib/configs.py
@article{woo2022explore,
title={Explore and Match: End-to-End Video Grounding with Transformer},
author={Woo, Sangmin and Park, Jinyoung and Koo, Inyong and Lee, Sumin and Jeong, Minki and Kim, Changick},
journal={arXiv preprint arXiv:2201.10168},
year={2022}
}