feat(model) : add segmentation model based on self-supervised representation #1362

SevKod · 2023-05-05T13:22:40Z

Added WavLM-Base model which replaces the SincNet feature extraction model within the PyanNet architecture (loaded outside of the class from HuggingFace.co).

hbredin · 2023-05-05T14:03:50Z

pyannote/audio/models/segmentation/PyanNet.py

+#Loading the model from HuggingFace (requires git lfs to load the .bin checkpoint)
+#model = AutoModel.from_pretrained('/content/drive/MyDrive/PyanNet/wavlm-base')
+
+model = AutoModel.from_pretrained('microsoft/wavlm-base')


This is definitely the reason why the training complains about GPU/CPU mismatch.

The WavLM module should be instantiated in __init__ and assigned as an attribute of the model.

Read this carefully to understand why.

…uding layer selection. Created a block (in replacement of the old WavLM one) called "selfsup.py" which loads and apply a specific SSL Torchaudio model, depending on PyanNet's input parameter. User can now also choose a specific layer which will then be used for feature extraction. Ex : seg_model = PyanNet(task=seg, model = "HUBERT_BASE", layer = 5) This will load "HUBERT_BASE" model and select the 6th layer for the feature extraction. If layer is not specified, will automatically use the first one (layer 0). All available models can be found at : https://pytorch.org/audio/main/pipelines.html

… class Can use pre-trained ssl models from huggingface using PyanHugg class. Tested (and working) models are : - "microsoft/wavlm-base" - "microsoft/wavlm-large" - "facebook/hubert-base-ls960" - "facebook/wav2vec2-base-960h" Class supports model and layer selection (as well as cache location for the downloaded model and configuration file). Ex : seg_model = PyanHugg(task=seg, selfsupervised={ 'model' : 'microsoft/wavlm-base', 'layer' : 2, 'cache' : 'mod_location/'})

hbredin · 2023-05-30T12:52:08Z

pyannote/audio/models/segmentation/PyanHugg.py

+        lstm = merge_dict(self.LSTM_DEFAULTS, lstm)
+        lstm["batch_first"] = True
+        linear = merge_dict(self.LINEAR_DEFAULTS, linear)
+        if (selfsupervised["model"] == "sincnet") :


I would remove support for SincNet completely to avoid any confusion.

Can load a fairseq ckpt from a pretrained model (which is converted to torchaudio wav2vec2 format)

…fairseq model)

…PyanNetWavLM

hbredin · 2023-09-11T15:33:57Z

pyannote/audio/models/blocks/selfsup.py

+            model = wav2vec2_model(**config)
+        model.load_state_dict(ordered_dict) #Assign state dict to the model
+
+        if finetune:


Peux-tu m'expliquer pourquoi cela est nécessaire ?

Je parle du passage au mode eval dans le cas où le WavLM est gelé.

J'avais cru comprendre que le .eval() était pertinent lorsque certains modules tel que des couches de Dropout, sont présentes dans le modèle en question que l'on souhaite passer en mode inférence (le cas pour WavLM). J'avais lu ce post qui conseillait l'utilisation des deux :

https://stackoverflow.com/questions/55627780/evaluating-pytorch-models-with-torch-no-grad-vs-model-eval

Après, de souvenirs, je n'avais pas identifié de quelconque changement au niveau des features entre le ".eval()" et le "no_grad". Apparemment, le .eval() consommerait plus de mémoire que le no_grad aussi... Donc, pourquoi pas l'enlever. C'était aussi pour étudier les changements entre le .eval() et le .train() quand je voulais voir ce qui se passait au niveau du finetuning.

model.eval() et torch.no_grad() ont deux rôles bien différents.

torch.no_grad() supprimer le calcul du gradient des couches concernées et est donc utile quand tu veux geler une partie du réseau.

model.eval() passe les couches qui ont un comportement particulier lors de l'apprentissage (e.g. dropout qui désactive aléatoirement certains poids, ou batchnorm qui calcule une moyenne des données qui la traverse) en mode inférence pour supprimer tout cet aléa.

En résumé, il ne faut pas utiliser ni model.eval() ni model.train() pour contrôler si tu finetunes ou non la partie feature extraction. Il faut seulement utiliser torch.no_grad() (pour geler) ou pas (pour finetuner).

Le passage en mode eval ou train est effectué automatiquement par pytorch-lightning lors des phases de validation et d'apprentissage.

hbredin · 2023-09-12T07:15:20Z

pyannote/audio/models/blocks/selfsup.py

+        if finetune: #Finetuning not working
+            print("Self-supervised model is unfrozen.")      
+            #config['encoder_ff_interm_dropout'] = 0.3
+            config['encoder_layer_norm_first'] = True


Can you explain?

The issue regarding the funetuning of WavLM seemed similar to a normalization issue that occured during the feature extraction process. If gradient is computed during training, validation will extract feature vectors that are almost identical amongst each frames of the input audio. I assume that it might be the reason why validation does not seem to improve (or change) during training (but I might be completely wrong on this...). Since this problem seemed similar to the one I encountered with WavLM from back a few months (and has been fixed), where features were the also the same amongst the frames, I tried applying a normalization step to see if the behavior of the features extracted would change. Did not seem to be the case... It is one of the many things I tried but forgot to remove when pushing codes ^^

…ntation (#1362) Co-authored-by: Hervé BREDIN <hbredin@users.noreply.github.com>

add WaVLM-Base model to PyanNet.py in replacement of SincNet

6d3af2e

Added WavLM-Base model which replaces the SincNet feature extraction model within the PyanNet architecture (loaded outside of the class from HuggingFace.co).

hbredin reviewed May 5, 2023

View reviewed changes

SevKod added 3 commits May 9, 2023 09:16

implement wavlm inside PyanNet class and add wavlm block

d03906b

hbredin reviewed May 30, 2023

View reviewed changes

SevKod and others added 16 commits May 31, 2023 10:00

remove support for sincnet block

e170eed

Remove unnecessary computation for unused deeper layers.

9f81c30

add support for fairseq pretrained ssl models

e5330fc

Can load a fairseq ckpt from a pretrained model (which is converted to torchaudio wav2vec2 format)

fairseq dependency only used if needed

7a21fc9

Merge branch 'develop' into PyanNetWavLM

6243f91

Remove unnecessary computation for unused deeper layers (regarding a …

328505c

…fairseq model)

Merge branch 'PyanNetWavLM' of github.com:SevKod/pyannote-audio into …

d4ddd53

…PyanNetWavLM

Merge branch 'develop' into PyanNetWavLM

63a9e42

Remove HuggingFace and fairseq dependencies from self-sup

cbd01a3

Merge branch 'PyanNetWavLM' of github.com:SevKod/pyannote-audio into …

f608eb7

…PyanNetWavLM

add support for torchaudio self sup models

d7e9203

fixed bug condition of wavlm_base and wavlm_large

81aafdd

add layer-wise pooling and finetuning (still wip)

b9c89b6

Merge branch 'develop' into PyanNetWavLM

8aba20e

Merge branch 'develop' into PyanNetWavLM

4a8bfe2

Merge branch 'develop' into PyanNetWavLM

2323105

hbredin reviewed Sep 11, 2023

View reviewed changes

hbredin reviewed Sep 12, 2023

View reviewed changes

hbredin added 4 commits September 13, 2023 22:39

feat: add SSeRiouSS architecture

cedf042

chore: remove old PyanSup

06641bf

chore: remove now replaced SelfSupModel block

31d08a4

doc: update changelog

421ba03

hbredin marked this pull request as ready for review September 15, 2023 14:25

hbredin changed the title ~~wip : add WaVLM-Base model to PyanNet.py in replacement of SincNet~~ feat : add segmentation model based on self-supervised representation Sep 15, 2023

hbredin changed the title ~~feat : add segmentation model based on self-supervised representation~~ feat(model) : add segmentation model based on self-supervised representation Sep 15, 2023

Merge branch 'develop' into PyanNetWavLM

5f9211c

hbredin merged commit f4fad60 into pyannote:develop Sep 18, 2023
0 of 3 checks passed

hbredin added a commit that referenced this pull request Sep 20, 2023

feat(model) : add segmentation model based on self-supervised represe…

b9548a7

…ntation (#1362) Co-authored-by: Hervé BREDIN <hbredin@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(model) : add segmentation model based on self-supervised representation #1362

feat(model) : add segmentation model based on self-supervised representation #1362

SevKod commented May 5, 2023

hbredin May 5, 2023

hbredin May 30, 2023

hbredin Sep 11, 2023

hbredin Sep 11, 2023

SevKod Sep 11, 2023

hbredin Sep 11, 2023

hbredin Sep 12, 2023

SevKod Sep 12, 2023

feat(model) : add segmentation model based on self-supervised representation #1362

feat(model) : add segmentation model based on self-supervised representation #1362

Conversation

SevKod commented May 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment