Skip to content

Latest commit

 

History

History
42 lines (28 loc) · 4.73 KB

README.md

File metadata and controls

42 lines (28 loc) · 4.73 KB

Speech command recognition (Keyword Spotting)

In this project we use the Speech Commands dataset, which contains short (one-second long) audio clips of English commands, stored as audio files in the WAV format. More in detail, the version 0.02 of the dataset contains 105.829 utterances of 35 short words, by thousands of different people. It was released on April 11th 2018 under Creative Commons BY 4.0 license and collected using crowdsourcing, through AIY by Google. Some of these words are "yes", "no", "up", "down", "left", "right", "on", "off", "stop" and "go".

This project is developed as a final project of the course Human Data Analytics.

Notebooks

  • Data Analysis And Preprocessing Inspection
    This notebook takes care of loading and preparing the dataset, splitting it into training, validation, and testing sets. It also provides some information about the dataset with plots. It also introduces the functions used to pre-process the data (for example adding noise).

  • Keyword Spotting: general training notebook
    This notebook defines the generale training and testing pipeline, giving some information about the validation metrics used. The training is performed using our baseline model cnn-one-fpool3, taken from [Arik17].

  • Bayesian Optimization and Feature Comparison with CNN
    This notebooks is used to train our custom CNN models. With the first of these models we perform a Bayesian optimization, and we use it for inspecting the importance of dropout and batch normalization, realizing a feature comparison and studying the effect of data augmentation on the training set.

  • Keyword Spotting: ResNet architecture and Triplet Loss implementation
    In this notebook we play with ResNet models for the keyword spotting task. We start by implementing a simple ResNet architecture inspired by [Tang18] and then, motivated by [Vygon21], we modify such model and we train it to get a meaningful embedded representation of the input signals. We finally use k-NN to perform the classification task on these intermediate representations.

  • Keyword Spotting: a neural attention model for speech command recognition
    This notebook implements an attention model for speech command recognition. It is obtained as a modification of a Demo notebook prepared by the authors of the paper A neural attention model for speech command recognition.

  • Keyword Spotting: Conformer
    In this notebook, thanks to the library audio_classification_models, we implement a baseline Conformer architecture inspired by [Gulati20]. This model combines Convolutional Neural Networks and Transformers to get the best of both worlds by modeling both local and global features of an audio sequence in a parameter-efficient way. In detail, we use only one Conformer block in order to reduce the number of model parameters. Moreover, we perform hyperparameter tuning by means of Bayesian optimization in order to find, among the models with less than 2M parameters, the one that leads to the best accuracy.

  • Keyword Spotting: GAN-based classification
    In this notebook we try to implement a GAN-based classifier inspired by the paper GAN-based Data Generation for Speech Emotion Recognition. Unfortunately, to date we have not been able to figure out how to properly train the generator and discriminator in this specific case. As a result, we cannot currently test this approach.

Utils

Demo App

In this repository you can find a demo application that can be run as a python script with python demo_ks.py. It allows you to select the model you want to use and, when started, it detects the words in the Speech Commands Dataset through the microphone (or any chosen input device).

You can also find a notebook that can be used to play some commands from the dataset, in order to test such application with non-real-time signals.

Collaborators

  • Daniele Ninni
  • Nicola Zomer