Increasing trend in the research community for video processing using artificial intelligence. Trending Tasks:
- Video classification.
- Video content description.
- Video question answering (VQA).
The main idea is to generate descritptions for unconstrained videos, which can be used in video retrieval, blind navigation, and video subtitling.
We use the Microsoft Research Video to Text (MSVD) dataset.
We extracted the visual features of the data set using :
- VGG-16 (like paper): gdrive link
Here is the our architecture.
We have trained the model using different techniques.
-
Base paper as in seq to seq -- video to text : gdrive link
-
Using drop out on features: gdrive
-
Using temporal attention: gdrive link
-
Using drop out and attention technique: gdrive line
From the results obtained in the explained experiments, we found out that the best results obtained are from using attention and drop out. Our model outperforms the original paper model in all used metrics as shown in the following table:
Contributions are always welcome!
Please read the contribution guidelines first.
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details