-
❓ Questions and HelpHello, thanks for the great work. I would like to use the model as a feature extractor. How can I get the output of the encoder? Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments
-
Well, this is a definitely nice feature to have in future in V2 How are you planning to use the encoder? |
Beta Was this translation helpful? Give feedback.
-
Hi all, @wuxx1624 did you manage to get it to work? I also want to use this model as a feature extractor: basically I would like to take the features from the output of the encoder (N_FRAMES@25fps x 512) and use them for another task. I can load the model and access the encoder using "model.encoder", but I cannot see what other preprocessing is done before running through the encoder, namely the STFT details. Basically I would like to take a look at the model's forward() function. Please excuse my potential stupidity and overall ignorance of jit as a format, but is there a way for me to look at the actual code from the jit file? Thanks a lot in advance. |
Beta Was this translation helpful? Give feedback.
-
@miraodasilva here is the audio normalization: mravanelli/SincNet#74 (comment) |
Beta Was this translation helpful? Give feedback.
-
I see, so I should:
Is this correct? If so, should I use the hamming window for the STFT? Should I use torch.stft? Again, it would be great if I could just take a look at the forward function or the STFT implementation, as even small differences/imprecisions in the implementation can yield drastically different results in my experience. Thanks. |
Beta Was this translation helpful? Give feedback.
-
@miraodasilva you can use this auto normalisation and recover the encoder/decoder from the torchscript. I can get 1:1 output :) But I will not share the code, because the @snakers4 doesn't want the release the code. But if you want, you can really recover the exact PyTorch code. I have even trained this model on LibriSpeech and reported the result somewhere in tickets. If you use ReZero-ed version of the encoder and transformer, you can reach also much faster convergence. |
Beta Was this translation helpful? Give feedback.
-
if I follow the steps above (using nvidia STFT as mentioned in the thread), I get a 10x512x13 for input 10x16000, is this what you obtain as well @tugstugi ? I was under the impression I was supposed to get features @ 25 fps ie. 10x512x25 for this input. Thanks a lot in advance. |
Beta Was this translation helpful? Give feedback.
-
16000 / 160 (hop_length) / 8x reduction = 12.5 so 13 is correct |
Beta Was this translation helpful? Give feedback.
-
This is strange
I was reluctant to mention it, but yes, it is no big deal, I assume everyone knows about it |
Beta Was this translation helpful? Give feedback.
-
This release solves this |
Beta Was this translation helpful? Give feedback.
@miraodasilva here is the audio normalization: mravanelli/SincNet#74 (comment)