Implements Wideband Audio Waveform Evaluation networks or WAWEnets.
This WAWEnets implementation produces one or more speech quality or intelligibility values for each input speech signal without using reference speech signals. WAWEnets are convolutional networks and they have been trained using full-reference objective speech quality and speech intelligibility values as well as subjective scores.
the .pt
model files in ./wawenets/weights
are plain pytorch model files, and are suitable for creating new traced JIT files for C++ or ONNX in the future.
Details can be found in the ICASSP 2020 WAWEnets paper1 and followup article6.
If you need to cite our work, please use the following:
@INPROCEEDINGS{
9054204,
author={A. A. {Catellier} and S. D. {Voran}},
booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Wawenets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality},
year={2020},
volume={},
number={},
pages={331-335},
}
In order to run the WAWEnets Python code, some initial setup is required. Please follow the instructions below to prepare your machine and environment.
SoX is an audio processing library and CLI tool useful for format conversions, padding, and trimming, among other things.
To install SoX on a Debian-based Linux, use the apt
package manager:
apt install sox
On macOS the easiest way to install SoX is by using brew. Follow the instructions to install brew, then use brew to install SoX:
brew install sox
In order to install SoX on Windows, follow the instructions on the SoX SourceForge page.
The Python WAWEnets implementation relies on ITU-T STL executables in order to resample audio files and measure speech levels.
We're using a few STL utilities (actlev
, filter
, and sv56demo
) for some functions that are also available in torchaudio
because this allows us to be reasonably certain that the audio processing steps are the same among all WAWEnets implementations (C++, MATLAB, etc.)
First we must compile the STL executables. To do this, clone the STL repo and then follow the build procedure
After the build procedure is complete, return to the WAWEnets Python implementation.
Create a copy of config.yaml.template
named config.yaml
:
cp wawenets/config/config.yaml.template wawenets/config/config.yaml
Edit config.yaml
to point to the bin
dir where the STL tools have been compiled, e.g. /path/to/STL/bin
.
One way to install the Python libraries required to run the Python version of WAWENets is using Anaconda (or Miniconda). Once Anaconda or Miniconda are installed, use the following commands to set up and activate a new conda env:
conda env create -f wenets_env.yaml
conda activate wenets_dist
After the Anaconda environment has been created and activated, execute the following code to install and test the wawenets
package:
cd wawenets
poetry install
pytest
After successfully completing the above steps, it should be possible to run the following command:
python wawenets_cli.py --help
and see its output:
Usage: wawenets_cli.py [OPTIONS]
the CLI interface Python WAWEnets produces quality or intelligibility
estimates for specified speech files.
Options:
-m, --mode INTEGER specifies a WAWEnet mode, default is 1
-i, --infile TEXT either a .wav file or a .txt file where each line
specifies a suitable .wav file. if the latter, files
will be processed in sequence. [required]
-l, --level BOOLEAN whether or not contents of a given .wav file should
be normalized. default is True.
-s, --stride INTEGER stride (in samples @16k samp/sec) on which to make
predictions. default is 48,000, meaning if a .wav
file is longer than 3 seconds, the model will
generate a prediction for neighboring 3-second
segments.
-c, --channel INTEGER specifies a channel to use if .wav file has multiple
channels. default is 1 using indices starting at 1
-o, --output TEXT path where a CSV file containing predictions should
be written. default is None, and results are printed
to stdout
--help Show this message and exit.
infile
is either a .wav file or a .txt file where each line specifies the path to a suitable .wav file. In this second case, the listed .wav files will be processed in sequence. NOTE: when using a .txt file to specify which .wav files to process, the software will always process the first channel of each file.
A suitable .wav
file must:
- be uncompressed
- have sample rate 8, 16, 24, 32, or 48k smp/sec.
- contain at least 3 seconds of speech
To best match the designed scope of WAWEnets, the .wav
file should have a speech activity factor of roughly 0.5 or greater and an active speech level near 26 dB below the clipping points of +/- 1.0 (see level normalization feature below). The native sample rate for WAWEnets is 16 k smp/sec so files with rates 8, 24, 32, or 48k rate are converted internally before processing.
-m M
specifies a WAWEnet mode. The integer M specifies the WAWEnet trained using a specific full-reference target.
-m 1
: WAWEnet trained using WB-PESQ2 target values (Default)-m 2
: WAWEnet trained using POLQA3 target values-m 3
: WAWEnet trained using PEMO4 target values-m 4
: WAWEnet trained using STOI5 target values-m 5
: WAWEnet trained using seven objective targets: WB-PESQ, POLQA, STOI, PEMO, ViSQOL3 (c310), ESTOI, and SIIBGauss6-m 6
: WAWEnet trained using four subjective targets (mos, noi, col, dis) and seven objective targets (WB-PESQ, POLQA, STOI, PEMO, ViSQOL3 (c310), ESTOI, and SIIBGauss)6
-l L
specifies internal level normalization of .wav
file contents to 26 dB below clipping.
-l 0
: normalization off-l 1
: normalization on (Default)
-s S
specifies specifies the segment step (stride) and is an integer with value 1 or greater. Default is -s 48,000
. WAWEnet requires a full 3 seconds of signal to generate a result. If a .wav
file is longer than 3 seconds multiple results may be produced. S
specifies the number of samples to move ahead in the speech file when extracting the next segment. The default value of 48,000 gives zero overlap between segments. Using this default any input less than 6 seconds will produce one result, based on just the first 3 seconds. A 6 second input will produce two results. If -s 24,000
for example, segment overlap will be 50%, a 4.5 second input will produce 2 results and a 6 second input will produce 3 results.
-c C
specifies a channel number to use when the input speech is in a multi-channel .wav
file. Default is -c 1
. NOTE: when using a .txt file to specify which .wav files to process, the software will always process the first channel of each file.
-o 'myFile.txt'
specifies a text file that captures WAWEnet results on a new line for each speech input processed. If the file exists it will be appended to. The extension .txt
will be added as needed. Default is that no .txt
file is generated.
The output for each of the N speech signals processed is in the format:
[row] [wavfile] [channel] [sample_rate] [duration] [level_normalization] [segment_step_size] [WAWEnet_mode] [segment_number] [start_time] [stop_time] [active_level] [speech_activity] [model_prediction]
where:
row
an identifier for the current row of outputwavfile
is the filename that has been processedchannel
is the channel ofwavfile
that has been processedsample_rate
native sample rate of thewavfile
duration
duration ofwavfile
in secondslevel_normalization
reflects whetherwavfile
was normalized during processingsegment_step_size
reflects the segment step (stride) used to processwavfile
WAWEnet_mode
is the modewavfile
has been processed withsegment_number
is a zero-based index that indicates which segment ofwavfile
was processedstart_time
is the time in seconds where the current segment began withinwavfile
stop_time
is the time in seconds where the current segment ended withinwavfile
active_level
active speech level of the specified segment ofwavfile
in dB below overloadspeech_activity
is the speech activity factor of the last specified segment ofwavfile
model_prediction
output value produced by WAWEnet for the specified segment ofwavfile
Internally, pandas
is used to generate the text output.
If the -o
option is specified, pandas
generates a CSV and writes it to the given file path.
Inside wawenets/wawenet_trainer
you will find code that will train WAWEnets. Unfortunately, we are not able to share any data to train on, but you can easily build your own dataset and use this code to train a WAWEnet custom for your application.
WAWEnets accept .wav
files with a sample rate of 16,000 samples/second and are exactly 3 seconds long. Put your .wav
files in a specific location, and use that location for the train.py
argument --data_root_path
.
The training code will read either pandas
dataframe-style JSON files or CSVs. These dataframes should have the following columns:
filename
: the name of the file described by this rowsplit
: eitherTRAIN
,TEST
,VAL
, orUNSEEN
. For which part of the training process should this file be used?impairment
: what speech processing impairment does this file exhibit?datasetLanguage
: what language are the talkers in this dataset speaking?[TARGET_NAME]
: include any target values you'd like to imitate[FILE_METADATA]
: (optional) any metadata you'd like to perhaps act on later Any dataframe in this format can be used as the argument--csv_path
.
One way to install the Python libraries required to run the Python version of WAWENets is using Anaconda (or Miniconda). Once Anaconda or Miniconda are installed, use the following commands to set up and activate a new conda env:
conda env create -f wenets_train_env.yaml
conda activate wenets_train
After the Anaconda environment has been created and activated, execute the following code to install and test the wawenets
package:
cd wawenets
poetry install
The training entrypoint is train.py
. It has extensive options all exposed to the command line via arguments:
python train.py [ARGS]
There are preset configurations that will define most of these options for you. Using generic_regime
for the --training_regime
argument is a good start.
Train your net!
python train.py --training_regime generic_regime --csv_path /path/to/csv --data_root_path /path/to/data
By default, results will be logged to ~/wenets_training_artifacts
and they will include dataframe result summaries as well as 2D-histograms showing predictions vs. actual values.
1 Andrew A. Catellier & Stephen D. Voran, "WAWEnets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 331-335. ↩
2 ITU-T Recommendation P.862, "Perceptual evaluation of speech quality (PESQ)," Geneva, 2001.↩
3 ITU-T Recommendation P.863, "Perceptual objective listening quality analysis," Geneva, 2018.↩
4 R. Huber and B. Kollmeier, "PEMO-Q — A new method for objective audio quality assessment using a model of auditory perception," IEEE Trans. ASLP, vol. 14, no. 6, pp. 1902-1911, Nov. 2006.↩
5 C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "An algorithm for intelligibility prediction of time-frequency weighted noisy speech," IEEE Trans. ASLP, vol. 19, no. 7, pp. 2125-2136, Sep. 2011.↩
6 Andrew Catellier & Stephen Voran, "Wideband Audio Waveform Evaluation Networks: Efficient, Accurate Estimation of Speech Qualities," arXiv preprint, Jun. 2022. ↩