diff --git a/dataset_dataloading/README.md b/dataset_dataloading/README.md index e602dd8..f85fcff 100755 --- a/dataset_dataloading/README.md +++ b/dataset_dataloading/README.md @@ -4,8 +4,8 @@ The section includes the csv files listing the data samples in Panda-70M and the **[Note] Please use the video2dataset tool from this repository to download the dataset, as the video2dataset from [the official repository](https://github.com/iejMac/video2dataset) cannot work with our csv format for now.** ## Data Splitting and Download Link - | Split | Download | # Source Videos | # Samples | Video Duration | Storage Space| - |-----------------|----------|-----------------|-----------|----------------|--------------| + | Split | Download | # Source Videos | # Samples | Video Duration | Storage Space | + |-----------------|----------|-----------------|-----------|----------------|---------------| | Training (full) | [link](https://drive.google.com/file/d/1DeODUcdJCEfnTjJywM-ObmrlVg-wsvwz/view?usp=sharing) (2.01 GB) | 3,779,763 | 70,723,513 | 167 khrs | ~36 TB | | Training (10M) | [link](https://drive.google.com/file/d/1Lrsb65HTJ2hS7Iuy6iPCmjoc3abbEcAX/view?usp=sharing) (381 MB) | 3,755,240 | 10,473,922 | 37.0 khrs | ~8.0 TB | | Training (2M) | [link](https://drive.google.com/file/d/1jWTNGjb-hkKiPHXIbEA5CnFwjhA-Fq_Q/view?usp=sharing) (86.5 MB) | 800,000 | 2,400,000 | 7.56 khrs | ~1.6 TB | @@ -44,16 +44,20 @@ video2dataset --url_list="" \
HTTP Error 403: Forbidden
- Your IP got blocked. Please use proxy for downloading. Refer this issue. + Your IP got blocked. Use proxy for downloading. Please refer this issue.
HTTP Error 429: Too Many Requests
- Your download request reaches a limitation. Please slow down the download speed. Refer this issue. + Your download requests reach a limit. Slow down the download speed by reducing processes_count and thread_count in the config file. Please refer this issue. In the json file:
"status": "failed_to_download" & "error_message":
"[Errno 2] No such file or directory: '/tmp/...'"
The YouTube video has been set to private or removed. Please skip this sample. + +
YouTube said: ERROR - Precondition check failed
+ Your yt-dlp version is out-of-date and need to install a nightly version. Please refer this issue. + ### Dataset Format @@ -82,6 +86,7 @@ output-folder - Meta information includes matching score (confidence score of each video-caption pair), caption, video title / description / categories / subtitles, to name but a few. - **[Note 1]** The dataset is unshuffled and the clips from a same long video would be stored into a shard. Please manually shuffle them if needed. - **[Note 2]** The videos are resized into 360 px height. You can change `download_size` in the [config](./video2dataset/video2dataset/configs/panda_70M.yaml) file to get different video resolutions. +- **[Note 3]** The videos are downloaded with audio by default. You can change `download_audio` in the [config](./video2dataset/video2dataset/configs/panda_70M.yaml) file to turn off the audio and increase download speed. ## Acknowledgements The code for data downloading is built upon [video2dataset](https://github.com/iejMac/video2dataset). diff --git a/dataset_dataloading/video2dataset/video2dataset/configs/panda_70M.yaml b/dataset_dataloading/video2dataset/video2dataset/configs/panda_70M.yaml index 6be887f..caddad3 100755 --- a/dataset_dataloading/video2dataset/video2dataset/configs/panda_70M.yaml +++ b/dataset_dataloading/video2dataset/video2dataset/configs/panda_70M.yaml @@ -3,7 +3,7 @@ subsampling: {} reading: yt_args: download_size: 360 - download_audio_rate: 44100 + download_audio: True yt_metadata_args: writesubtitles: True subtitleslangs: ['en'] @@ -21,4 +21,4 @@ distribution: processes_count: 32 thread_count: 32 subjob_size: 10000 - distributor: "multiprocessing" \ No newline at end of file + distributor: "multiprocessing" diff --git a/dataset_dataloading/video2dataset/video2dataset/data_reader.py b/dataset_dataloading/video2dataset/video2dataset/data_reader.py index 03d14c8..3a23717 100755 --- a/dataset_dataloading/video2dataset/video2dataset/data_reader.py +++ b/dataset_dataloading/video2dataset/video2dataset/data_reader.py @@ -166,6 +166,7 @@ def __init__(self, yt_args, tmp_dir, encode_formats): self.metadata_args = yt_args.get("yt_metadata_args", {}) self.video_size = yt_args.get("download_size", 360) self.audio_rate = yt_args.get("download_audio_rate", 44100) + self.download_audio = yt_args.get("download_audio", False) self.tmp_dir = tmp_dir self.encode_formats = encode_formats @@ -177,9 +178,9 @@ def __call__(self, url): modality_paths = {} video_format_string = ( - f"wv*[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}/" - f"w[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}/" - f"bv/b[ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}" + f"wv*[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba' if self.download_audio else ''}/" + f"w[height>={self.video_size}][ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba' if self.download_audio else ''}/" + f"bv/b[ext=mp4]{'[codec=avc1]' if self.specify_codec else ''}{'+ba' if self.download_audio else ''}" ) audio_fmt_string = ( f"wa[asr>={self.audio_rate}][ext=m4a] / ba[ext=m4a]" if self.audio_rate > 0 else "ba[ext=m4a]"