Skip to content

AlexKly/Simple-Voice-Activity-Detector-using-MFCC-based-on-FPGA-Kintex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple Voice Activity Detector (VAD) using MFCC features based on FPGA Kintex 7

Introducing

The Voice Activity Detector is detection system of the presence or absence of human speech segments in input audio signal, used in the speech processing systems. To perform classification the speech into two classes, the system need to process input audio samples in to features containing hidden infomation about speech. And then the system can classifier several segments based on get features using Machine Learning approches.

This my first and test version of the VAD based on the FPGA Kintex 7. This version is not optimized in terms of FPGA resoures and I used quite simple model to reach the classification task. However, the huge reserves of Kintex resourses make it possible to try different structures of algoritms that are not resourse-optimized.

The need to implement this algoritm based on FPGA is closely related to the need of the realization whole complex Speech Recognition system for the Smart Home tasks. And further works I will use more complex Deep Learning models to classification speech and more optimized structure for realization on FPGA.

You can find the new version with BiLSTM and CNN-BiLSTM here.

Description

In this project I don't attach HDL code for conversion process input analog signal from microphone using ADC (Analog-Digital Converter). I used ADC 16-bit 16kHz with I2S interface for getting digital samples.

Project's code starts off by getting data from ADC using I2S interface.

Next I built pipeline for extraction features from time-series signal. I was based on Python's library python_speech_features. As features for the model inputs I used MFCC. (You can learn more here).

As speech segments solver I have used DNN with defined structure. Solver get on the input scaling MFCC features and return for each speech segment label: absence human speech or not. I have trainded the model and I have used calculated weights and biases to develop HDL code using Vivado HLS and C++.

As device input I have used MEMS microphone connected to ADC scheme. It, in turn, connected to FPGA Kintex 7 pins.

As device output I have used LED indicator connected to FPGA Kintex 7 pins to monitor the processing VAD result.

Project structure

This repository contain sourse files for realization whole VAD system on FPGA and Jupyter Notebook for presentation result of the model processing, details of the extraction features and common pipeline model processing. Also I have used Vivado HLS intrument to implement DNN model in FPGA using C++. It's a simple example of the model architecture description. The final result is user IP core, which implemented in the Vivado project.

In repository you can find saved model's weights and biases to use for implementation model in the FPGA structure using High-Level synthesis. It makes possible to change the parameters and strcture the machine learning model.

Bellow you can get acquainted with common structure of this repository.

Python: MFCC processing pipeline and DNN modeling

In Python part you can find Jupyter Notebook for testing model. Also I'm storing model instance with calculated weights and biases to use it in implementation. Also I attached some metrics to evaluate the model's processing. It can be possible to learn it and perform some changes to improve the model processing. All calculations provided with help python_speech_features library, but when we talk about HDL realization we need to jump right in

MFCC pipeline:

Steps to get MFCC:

  1. Pre-emphasis or filtering:

A pre-emphasis filter is useful in several ways: balance the frequency spectrum since high frequencies usually have smaller magnitudes compared to lower frequencies, avoid numerical problems during the FFT and also improve the Signal-to-Noise Ratio (SNR).

where x(t) - original (input) signal,

y(t) - preemphased (output) signal,

alpha - filter coefficient. Typical values are 0.95 - 0.97.

Bellow are the graphs of the signal before and after filtering:

Original signal Pre-emphased signal

  1. Framing:

After pre-emphasis, we need to split signal into short-time frames. The rationale behind this step is that frequencies in a signal change over time, so in most cases it doesn’t make sense to do the Fourier transform across the entire signal in that we would lose the frequency contours of the signal over time. To avoid that, we can safely assume that frequencies in a signal are stationary over a very short period of time. Therefore, by doing a Fourier transform over this short-time frame, we can obtain a good approximation of the frequency contours of the signal by concatenating adjacent frames.

Bellow is the graph of the single frame: Single frame

  1. Window:

After framing the signal, we apply a window function to each frame (In our case it's Hamming window function).

There are several reasons why we need to apply a window function to the frames, notable to counteract the assumption made by the FFT that the data if infinite and to reduce spectral leakage.

  1. Fourier-Transform and Power Spectrum:

After windowing each frame we can do an N-point FFT on each frame to calculate the frequency spectrum, which is also called Short-Time Fourier-Transform (STFT). And then compute the power spectrum using the following equation:

where x_i is the i-th frame of signal x(t),

N - number of the points FFT.

Bellow is the graph of the power spectrum:

Power spectrum

  1. Energy on frame: After calculation power spectrum, we need to calculate energy per frame and to append to MFCC vector later.

To calculate energe om the frame, we use following equation:

where P_i - power spectrum on the i-th frame,

FrameLength - Samples number in the frame.

The graph of the energy changing on the frames is shown bellow: Energy of the signal

Eventually, we need to apply log operation to energy massive.

  1. Filter Banks:

The final step to computing filter banks is applying triangular filters on a Mel-scale to the power spectrum to extract frequency bands. The Mel-scale aims to mimic the non-linear human ear perception of sound, by being more discriminative at lower frequencies and less discriminative at higher frequencies. We can convert between Hertz (f) and Mel (m) using the following equations:

Each filter in the filter bank is triangular having a response of 1 at the center frequency and decrease linearly towards 0 till it reaches the center frequencies of the two adjacent filters where the response is 0, as shown in this figure:

Filter bank on a Mel-Scale

This can be modeled by the following equation (taken from here):

After applying the filter bank to the power spectrum of the signal, we obtain the following spectrogram:

Spectrogram of the Signal

  1. Mel-frequency Cepstral Coefficients (MFCCs):

It turns out that filter bank coefficients computed in the previous step are highly correlated, which could be problematic in some machine learning algorithms. Therefore, we can apply Discrete Cosine Transform (DCT) to decorrelate the filter bank coefficients and yield a compressed representation of the filter banks.

The resulting MFCCs: MFCCs

  1. Deltas:

Also known as differential and acceleration coefficients. The MFCC feature vector describes only the power spectral envelope of a single frame, but it seems like speech would also have information in the dynamics i.e. what are the trajectories of the MFCC coefficients over time. It turns out that calculating the MFCC trajectories and appending them to the original feature vector increases ASR performance by quite a bit (if we have 12 MFCC coefficients, we would also get 12 delta coefficients, which would combine to give a feature vector of length 24).

To calculate the delta coefficients, the following formula is used:

where d_t is a delta coefficient, from frame t computed in terms of the static coefficients c_{t+N} to c_{t-N}. A typical value for N is 2. Delta-Delta (Acceleration) coefficients are calculated in the same way, but they are calculated from the deltas, not the static coefficients.

The result combined MFCC and Delta-Delta is shown bellow: MFCC + Deltas

C++: Vivado HLS implemantation

The Vivado HLS is quick and quiet simple approach to implement the DNN model in FPGA. When the model is learnt you can get model's weights and biases. Next, you can code predict() funtion (forward propagation) using C++.

After ascertain that C++ code works right, we can go to the next step. The Vivado HLS generate achive with your model implementaion like a archive which you can use in the Vivado project.

Below is shown resources required by the module and module's structure of the input/output ports:

Utilization Estimates

Interface

It's really not optimized method for design, but I use the Kintex 7 and I don't worry about resourser in my test project because it has really a huge logic unit number.

Also I attached links to tutorial how to use Vivado HLS:

Later, it will be described how to add the generated module to the project.

FPGA: VAD realization

I divided the Vivado project on the parts. I attached the image below for more comfortable learning the structure.

Project structure

Vivado project structure:

  • Common modules: main file (Vega_submain.v) and I2S receiver (capture_audio_sample.v)
  • MFCC features pipeline part: you can find in 'Calculation MFCC features' for calculation MFCC features and delta for model's inputs
  • Machine learning pipeline part: implementation DNN module like a IP core generated using Vivado HLS
  • IP cores: all IP cores using in this project
  • TestBench: file (tb_Vega_submain.vhd) to test the whole project

The intance of the LED indication using DNN solver signal:

VAD_module: Vega_submain
    port map (
        g_fast_clk  => S_AXI_ACLK,
        bclk        => bclk_in,
        wclk        => wclk_in,
        d_audio     => d_audio_in,
        DNN_Done    => DNN_Done,
        DNN_Result  => DNN_Result
    );
        
process(DNN_Done) begin
    if rising_edge(DNN_Done) then
        if (DNN_Result = '0') then
            Marker_On_LED_1 <= '0';
        else
            Marker_On_LED_1 <= '1';
        end if;
    end if;
end process;

Add it in the main part and connect to FPGA output pin.

Data

So, I operated with preprocessed dataset. I have used data from this Kaggle competitions. I processed wav files with sampling rate is equal to 16000.

Requirement

Hardware:

  • Kintex 7 (xc7ktffg900-2) on the factory board (processing unit)
  • ADC 16-bit with I2S interface on factory board (convertion analog audio)
  • MEMS microphone or micro jack for debug audio from PC (input analog audio)
  • Resistor 500 Omh (for connection LED)
  • Red LED (for indication)

Soft:

  • Anaconda (Jupyter Notebook)
  • Visual Studio 2019
  • Vivado 2016.2
  • Vivado HLS 2016.2

Processing result machine learning model

In the modeling part we can select satisfying model structure for VAD. In the DNN modeling.ipynb I perform preparation data to fit the model and estimate results several DNN structures. The binary classification mean accuracy is fluacting between 0.8 - 0.9 depended from structure. I have used accuracy and ROC-AUC metrics to estimate models.

You can see plots with result of the processing VAD alghorithm on the validation real signal and ROC-AUC curves for several estimated models.

VAD result

ROC-AUC curve

In the DNN modeling.ipynb I prepared printing the C++ arrays with model weights to simplify transfer it in C++ code. You can see that the DNN processing results is not perfect and there are enough Type I errors (False positive) and Type II errors (False negative), but it's appropriate for me.

Simulation the FPGA realization

Checking the correct operation of the architecture occurs by influencing the input of the model with a counter that increases by 1 at each step (cycle). Comparing the Python realization and FPGA simulation is giving partly aprovment of the correct algorithm operation. So, lets to compare results for MFCC, delta and DNN processing. Below, is shown images of the simulation FPGA realization and comparison with operation Python's frameworks.

Example Python code of the counter processing (for first frame):

test_input = np.linspace(0, 65535, 65536)
test_features = mfcc(test_input,
                     SAMPLE_RATE,
                     winlen=FRAME_LENGTH,
                     winstep=FRAME_STEP,
                     numcep=NFEATURES,
                     nfilt=NFILTERS,
                     nfft=NFFT,
                     lowfreq=0,
                     highfreq=None,
                     preemph=PREEMPHASIS_COEF,
                     ceplifter=CEPLIFTER,
                     appendEnergy=APPEND_ENERGY,
                     winfunc=WINDOW_FUNCTION)
d_test_features = delta(test_features, N=2)
d2_test_features = delta(d_test_features, N=2)
test_features_deltas = np.hstack((test_features, d_test_features, d2_test_features))
pred = model.predict(test_features_deltas)

Results of the counter processing (for first frame):

MFCC result. First frame of the counter: 10.45426599 21.31225632 16.06946506 15.04146288 16.36829557 16.91198305 18.26702311 18.23718373 18.41393675 17.85163217 17.92255124 17.04174921 16.01008549
First order of  delta MFCC result. First frame of the counter: 0.39108477 0.40589755 0.27888866 0.58250356 0.61608484 0.76784735 0.82442396 0.90298802 0.9289716  0.95048831 0.93405364 0.90662207 0.85006181
Second order of  delta MFCC result. First frame of the counter: 0.05386912 0.02714361 0.06751415 0.07705067 0.10155177 0.1143419 0.12899826 0.13723392 0.143734   0.14520386 0.14398903 0.13859931 0.13072916
DNN prediction result. First frame of the counter: 0

Start of the result MFCC + Delta-Delta conversation: Start conversation

End of the result MFCC + Delta-Delta conversation: End conversation

tvalid_SVM - valid flag of the result MFCC + delta-delta MFCC processing.

SVM - result MFCC + delta-delta MFCC processing.

DNN interface and processing: DNN processing

Output result of the DNN: DNN result

mfcc_ce0 - latch of the input data on each clock (output).

ap_clk - clock signal (input).

ap_rst - reset signal: 1'b1 - active, 1'b0 - disable (input).

ap_start - valid signal when input data is going (input).

ap_done - valid signal when you can take DNN result (output).

ap_idle - signal when DNN is waiting data (output).

ap_ready - same with ap_done (output).

ap_return[31:0] - output data in integer format. We will expect 1 or 0 (output).

ap_address[5:0] - counter of the input data. Count to 39 (output).

mfcc_q0[31:0] - input data port in float32 format (input).

So, you need to take prediction ap_return[31:0] from DNN module using ap_done valid signal. Next, you can use ap_done as latch if you will use DNN result in further processing.

Demonstration and results

In the table you can see resources that FPGA spent for DNN. Mainly, it's LUTs and multipliers (DSPs) DNN used resources

After sythesis all project we got Bitstream file for programming device.

Bellow you can see results of the processing algorithm in real time (video sample for test):

Kintex7 Demo