Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal processing letters, 24(3), 279-283. https://doi.org/10.1109/LSP.2017.2657381
The UrbanSound8k dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy.All excerpts are taken from field recordings uploaded to www.freesound.org.
8732 audio files of urban sounds (see description above) in WAV format. The sampling rate, bit depth, and number of channels are the same as those of the original file uploaded to Freesound (and hence may vary from file to file).
The UrbanSound8k dataset used for model training, can be downloaded from the following link: https://urbansounddataset.weebly.com/
Urban_data_preprocess.ipynb: Pre-processing data and also augmenting data
Urban_nn_model.ipynb: Running 10 fold cross val on original data using simple NN
Urban_cnn_model.ipynb: Running 10 fold cross val on original and augmented data using CNN
Urban_data_generator.ipynb: Contains code for a data-generator that can be used for training with augmented data using CNN
10 Fold Cross Val Accuracy for NN using original data: 57.43%
10 Fold Cross Val Accuracy for CNN using original data: 62.61%
10 Fold Cross Val Accuracy for CNN using augmented data: 63.90%
Extend data more by using different parameters for augmentation
Apply Hyperparameter optimization and test different architectures
Librosa was used for data preprocessing and feature extraction.
In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.
Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum").
A mel-scaled spectrogram.
In music, the term chroma feature or chromagram closely relates to the twelve different pitch classes. Chroma-based features, which are also referred to as "pitch class profiles", are a powerful tool for analyzing music whose pitches can be meaningfully categorized (often into twelve categories) and whose tuning approximates to the equal-tempered scale. One main property of chroma features is that they capture harmonic and melodic characteristics of music, while being robust to changes in timbre and instrumentation.
A chromagram from a waveform or power spectrogram.
Constant-Q chromagram.
The chroma variant “Chroma Energy Normalized” (CENS).