Skip to content

Classifying the language spoken in audio clips of native speakers using audio augmentation techniques and Convolutional Neural Networks

Notifications You must be signed in to change notification settings

bhulston/Language-Classification-of-Audio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

Language-Classification-of-Audio

Classifying the language spoken in audio clips of native speakers using audio augmentation techniques and Convolutional Neural Networks. MP3 files are scraped from audio-lingua, and saved onto AWS.

Project Outline

  1. Web Crawler
  2. AWS Storage
  3. Audio Augmentation
  4. Data Representation
  5. CNN

Web Crawler/Scraper

I am collecting data from a website called audio-lingua, that has a database of about 8 languages of audio clips of native speakers. I collected data on the most popular languages, English, Russian, French, Spansih, Chinese, and German.

Below is an example of the download pages:

image

Using the python package requests and Beautiful Soup, we can iteratively go through the different webpages, and navigate the HTML code with Beautiful Soup to find the download links. I collect them in a csv of strings and iteratively download them onto AWS

Code Snippet:

image

AWS storage

For storage we use S3, and make a bucket that contains different file paths for our different audio files... image

I am using the boto3 package for accessing AWS

Audio Augmentation

Augmentation is one of the most important steps when working with both audio or image data. The reason for this is that the classifications behind different audio images are never static.

  • Audio is dynamic and the same sentence can be expressed on many wavelengths or frequencies.
  • In order to deal with this we have different ways to transform audio including using Time Mask, High Pass Filter, White Noise, Pitch Scaling, Polarity Inversion, Random Gain
    • Using NumPy to apply random transformations of the audio data
    • This increases the number of samples we have to work with, and also helps our model to generalize a little better, and be sensitive to small changes in the audio clips

Augmentation Functions

Time Mask - Randomly set values to 0, randomly blocking off some of the time periods to reduce overfitting

High Pass Filter - Only keep some of the values from the high frequencies, removing certain low frequency values of the sound

White Noise - The basic white noise you hear that is sort of a constant signal underneath the other sounds

Pitch Scaling - Change the pitch at which the audio clips are represented

Polarity Inversion - A simple transformation, multiplying all values by -1

Visual Examples - Random Transformation

-Time Mask image

-Noise and Gain: Add some base white noise, and gain that increases the abs value of frequency levels

Original Signal:

image

Signal Augmented:

image

As you can see, the scale of the values has changed, and there is some noise and gain causing slightly larger fluctuations in the audio file.

Data Representation

We represent data for the CNN as "images"

  • The images below represent a Mel Spectrogram and MFCC. The Mel Spectogram is a close representation of the audio that humans hear, one that highlights sound waves at certain frequencies.
    • The wavelengths we hear and the ones a dog hears are not the same! So the mel spectrogram does a better job of representhing these values
  • These arrays that define these spectrograms are going to be put into a Convolutional Neural Network
    • In essence we are using computer vision techniques to classify different audio clips into the spoken language

image

image

Convolutional Neural Network Model

The main considerations are to ensure that the model generalizes well. Because audio is so dynamic, this can be an extremely difficult task for a CNN.

  • Beyond simple words, classifying audio representations into languages based off its spectral qualities can be difficult, if not impossible.
  • To do this, we need to force the model to generalize
    • Adding regularization
    • Adding dropout
    • Audio augmentations
    • One hot encoding labels
      • Change the 1 labels to 0.9 and others to 0.1. This helps generalization since the model is never 100% sure about an answer

Here is the model of the convolution network:

image

Visualization of model: image

As you can see, we have deep convolutional layers and max pooling layers, followed by a flattening layer and deep dense layers with regularization

  • Research has shown that increasing the number of neurons in the convolutional network on each layer improves performance
  • We add padding to be the same so that the inputs and outputs of convolution layers are consistent
  • We also use MaxPooling and eventually flatten our outputs from the convolutional layers
    • With these outputted values, we run them through some dense layers to actually calculate weights
  • We use Categorical Cross Entropy as our error metric which is essentially a softmax objective function applied to cross entropy

About

Classifying the language spoken in audio clips of native speakers using audio augmentation techniques and Convolutional Neural Networks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published