-
I was trying to implement your model but there was a problem. The text encodings for each word are of length 773(768+5) and the audio input is of the length 64(depending on which feature extractor is used). Concatenating these gives an input vector of 837 for each frame, whereas in the paper(Section 4.4) it is stated "..Speech-encoding dimensionality 124 for each ..". I went through the implementation too, but didn't find any form of dimensionality reduction (PCA or any other) done for the input vectors(text or the audio). Please let me know how did you get the input dimensions as 124 because taking it as 837 results in a huge network. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Hi @Shrey-55, My understanding is that the first thing that happens is that each frame (the high-dimensional text+audio vector) is passed through a feed-forward network that encodes it to a 124-dimensional vector. (If you are familiar with CNNs, this can alternatively be seen as a "1x1 convolution".) I don't know where in the code this happens, but the paper does include a description of this dimensionality reduction:
|
Beta Was this translation helpful? Give feedback.
Hi @Shrey-55,
My understanding is that the first thing that happens is that each frame (the high-dimensional text+audio vector) is passed through a feed-forward network that encodes it to a 124-dimensional vector. (If you are familiar with CNNs, this can alternatively be seen as a "1x1 convolution".) I don't know where in the code this happens, but the paper does include a description of this dimensionality reduction: