-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dense Video Captioning on raw input videos #11
Comments
Hi. Thanks for the interest in our work! I am afraid there are several obstacles that are hard to solve before fulfilling your request, namely:
Therefore, we are not supporting this feature 🙁. At the same time, let me provide some advice on how to try to mitigate the issues.
On a good note, we made the rest of the ingredients to be more accessible. Specifically:
Of course, let me know if you have any further questions about the procedure via e-mail or in Issues in this repo. |
Hi, thanks for your prompt and detailed reply. Regarding the single video prediction in BMT, I already used the video_features repo, and the scripts you provided. Running that code is very convenient thanks to the detailed steps you have provided. But the work that I am currently pursuing requires me to generate descriptions grounded to the audio transcript. So that is why I was trying to get this code running, as it takes speech also into account. |
Thank you very much, I will try this script, and hopefully can run the code for single video. |
Sorry to bother you again. I tried that script, it is working fine. I have a few other doubts.
|
No problem. Thanks for your questions.
import h5py
# create empty hdf5 files on your disk
h5py.File(hdf5_vggish_features_path, 'w').close()
h5py.File(hdf5_i3d_features_path, 'w').close() I will assume that you extracted features for your custom videos in some folder. This means that you may have several files with your features for a video. For instance, for a import os
# returns an unsorted list of filenames from `features_path`
list_of_files = os.listdir(features_path)
# append filenames to parent directory (and making sure non-`.npy` files are ignored)
list_of_paths = [os.path.join(features_path, fname) for fname in list_of_files if fname.endswith('.npy')]
# we will care only for the beginning of a file path (only `/some_path/video_1`)
paths = [path.replace('_vggish.npy', '').replace('_rgb.npy', '').replace('_flow.npy', '') for path in list_of_paths]
# we expect to have 3 duplicate paths for each video. Remove duplicates
paths = list(set(paths)) Then, open the files, read features from numpy files, and append features in a for-loop import numpy as np
# start context managers ('a' == append) for both files
with h5py.File(hdf5_vggish_features_path, 'a') as hd5vgg, h5py.File(hdf5_i3d_features_path, 'a') as hd5i3d:
# the for-loop
for path in paths:
# construct new paths
vggish_path = f'{path}_vggish.npy'
rgb_path = f'{path}_rgb.npy'
flow_path = f'{path}_flow.npy'
# loading numpy files
vggish = np.load(vggish_path)
rgb = np.load(rgb_path)
flow = np.load(flow_path)
# extract video names from the paths (relying on the rgb path only)
# os.path.split() outputs a list which contains parent dir path [0] and filename [1]
# and removing the part with '_rgb.npy' (`video_1_rgb.npy` -> `video_1`)
video_name = os.path.split(rgb_path)[-1].replace('_rgb.npy', '')
# append features to the hdf5 files
# VGGish
hd5vgg.create_dataset(f'{video_name}/vggish_features', vggish.size(), data=vggish)
# RGB
hd5i3d.create_dataset(f'{video_name}/i3d_features/rgb', rgb.size(), data=rgb)
hd5i3d.create_dataset(f'{video_name}/i3d_features/flow', flow.size(), data=flow) Please note that I provide this code just for guidance as I neither compiled this code nor tested it locally. Adapt it to your needs. That's it. If you like you can update the code of
import torch
from torch.utils.data import DataLoader
from dataset.dataset import ActivityNetCaptionsIteratorDataset
val_dataset = ActivityNetCaptionsIteratorDataset(
'<s>', '</s>', '<blank>', 1,
28, './data/sub_activitynet_v1-3.i3d_25fps_stack24step24_2stream.hdf5', 'i3d',
False, False,
'./data/sub_activitynet_v1-3.vggish.hdf5', 'vggish',
False, False,
'./data/train_meta.csv', './data/val_1_meta.csv', './data/val_2_meta.csv',
torch.device('cuda:0'), 'val_1', 'subs_audio_video',
False, props_are_gt=True, get_full_feat=False
)
val_loader = DataLoader(val_dataset, collate_fn=val_dataset.dont_collate) Next, you will want to update the import pandas as pd
import h5py
def predict_1by1_for_TBoard(your_meta_path, vggish_hdf5_features_path, i3d_hdf5_features_path,
vid_ids_list, val_loader, decoder, model, max_len=100):
'''
your_meta_path: path to your .csv
*_hdf5_features_path: path to your hdf5 files
vid_ids_list: the ids which will be used to filter meta file. For example: ['video_1', 'video_2']
val_loader: object defined above
decoder: just pass the greedy_decoder function
model: pass the pre-trained model
max_len: largest caption possible. The generation will stop if exceeded
'''
# for dataframe example see `./data/val_1_meta.csv`. Make sure your video_id will correspond to the
# filenames w/o extension (`video_1_rgb.npy` -> `video_1`) which you used to create hdf5 files
# because load_multimodal_features_from_h5() will use them
# as well as they should be present in `vid_ids_list` variable
meta = pd.read_csv(your_meta_path, sep='\t')
# re-define hdf5 files with your custom ones
feat_h5_audio = h5py.File(vggish_hdf5_features_path, 'r')
feat_h5_video = h5py.File(i3d_hdf5_features_path, 'r')
feature_names = val_loader.dataset.feature_names
device = val_loader.dataset.device
start_idx = val_loader.dataset.start_idx
end_idx = val_loader.dataset.end_idx
pad_idx = val_loader.dataset.pad_idx
modality = val_loader.dataset.modality
text = ''
for vid_id in vid_ids_list:
meta_subset = meta[meta['video_id'] == vid_id]
text += f'\t {vid_id} \n'
for (video_id, cap, start, end, duration, category, subs, phase, idx) in meta_subset.values:
feature_names_list = val_loader.dataset.features_dataset.feature_names_list
train_subs_vocab = val_loader.dataset.train_subs_vocab
# rgb is padded with pad_idx; flow is padded with 0s: expected to be summed later
video_stack_rgb, video_stack_flow, audio_stack = load_multimodal_features_from_h5(
feat_h5_video, feat_h5_audio, feature_names_list, video_id, start, end, duration
)
subs_stack = encode_subs(train_subs_vocab, idx, meta, start_idx, end_idx)
video_stack_rgb = video_stack_rgb.unsqueeze(0).to(device)
video_stack_flow = video_stack_flow.unsqueeze(0).to(device)
audio_stack = audio_stack.unsqueeze(0).to(device)
subs_stack = subs_stack.unsqueeze(0).to(device)
stack = video_stack_rgb + video_stack_flow, audio_stack, subs_stack
trg_ints = decoder(model, stack, max_len, start_idx, end_idx, pad_idx, modality)
trg_ints = trg_ints.cpu().numpy()[0]
trg_words = [val_loader.dataset.train_vocab.itos[i] for i in trg_ints]
en_sent = ' '.join(trg_words)
text += f'\t P sent: {en_sent} \n'
text += f'\t P proposals: {start//60:.0f}:{start%60:02.0f} {end//60:.0f}:{end%60:02.0f} '
text += '\t \n'
return text Please use Again I haven't tested the code, and I really hope it will work. Let me know if you have any further questions. |
Thank you very much |
Thanks a lot for your help, this code works for predicting captions for a single video, and the results are also good. |
Great 🎉! Yep, I also noticed it today 🙂. If you have a working example that you are comfortable with sharing, just type it here, or we may also discuss how to form a pull request. I think even it would be interesting even for youtube videos only. |
Sure! 2 3 4 5
` |
🙂 I meant the script which takes a subs file, vggish and i3d features and outputs a set of predictions for, at least, the GT proposals. |
Oh, my bad!
Since I didn't have ground truth proposals for this module of my project, I made some changes(put "none" where I didn't have anything, etc.). Also some changes in parsing timestamp, because my ASR output is such.
And I created run.py in the same folder (/MDVC)
Then I call |
Cool thanks. |
Hi Vladimir, |
Hi, Unfortunately, we don’t have a pytorch implementation. |
Thanks for your prompt reply. So, did you use their tensorflow implementation to generate the proposals and then use them for the captioning? |
Nope, we just used the predicted proposals. |
Could you please share your code for dense video captioning on raw input videos? Thanks! |
Unfortunately, I don't have the code for this, as I switched to the BMT code provided by @v-iashin (https://github.com/v-iashin/BMT#single-video-prediction). |
Thanks for provide the code, how about the result of dense video generation? Could it be used in practical applications? Thanks! |
Hi Keerti, I hope you doing great! |
Hi @taruntiwarihp, unfortunately, I don't have the code regarding MDVC. As I mentioned above, I switched to BMT myself. All I had for MDVC is mentioned in the issues above. |
No Problem @harpavatkeerti I can do it myself .. once I did I'll share it with you. |
Does anyone have code to run MDVC on our own videos? |
It seems a nice work. I wanted to test it on custom input videos. It would be very helpful if you can provide a script for generating video captions for a raw input video.
The text was updated successfully, but these errors were encountered: