Here in this project, we have tried to find out the dominant speaker in YouTube videos. Videos from YouTube has to be classified into 7 categories i.e. 6 personalities and 1 class as noise (frames when speaker is absent). 6 personalities are -
Videos from YouTube are taken of these 6 personalities in 720p quality. Frames are extracted at 1 fps from these videos using ffmpeg.
Since in every video has a unique speaker, so first we try to solve this problem using face recognition. For finding face embeddings we have used OpenFace Library.
Since the problem is basically object detection, so we haved tried to use transfer learning for CNN pre-trained on ImageNet. We did two types of fine tuning on CNN -
Weights of pre-trained CNN has been used for initialization and parameters of all the layers has been updated.
Only the parameters of last layer of CNN has been updated while the rest of the layers has been freezed.
We have used data augmentation for avoiding the over-fitting of the models. We have randomly cropped frame, flip it horizontal and cropped it. We have included faces of these personalities to avoid CNN remebering the background of the frames. These faces were extracted from the OpenFace Library.
Technique | Acc. |
---|---|
With data augmentation | 62.92 % |
without data augmentation | 53.20 % |
Technique | Acc. |
---|---|
With data augmentation | 69.46 % |
without data augmentation | 62.27 % |