Skip to content

Latest commit

 

History

History
93 lines (64 loc) · 2.69 KB

README.md

File metadata and controls

93 lines (64 loc) · 2.69 KB

This is a pytorch implementation of FFTNet described here. Work in progress.

Quick Start

  1. Install requirements
pip install -r requirements.txt
  1. Download CMU_ARCTIC dataset.

  2. Train the model and save. The default parameters are pretty much the same as int the original paper. Raise the flag --preprocess when execute the first time.

python train.py \
    --preprocess \
    --wav_dir your_downloaded_wav_dir \
    --data_dir preprocessed_feature_dir \
    --model_file saved_model_name \
  1. Use trained model to decode/reconstruct a wav file from the mcc feature.
python decode.py \
    --infile wav_file
    --outfile reconstruct_file_name
    --data_dir preprocessed_feature_dir \
    --model_file saved_model_name \

FFTNet_generator and FFTNet_vocoder are two files I used to test the model workability using torchaudio yesno dataset.

Current result

There are some files decoded in the samples folder.

Differences from paper

  • window size: 400 >> depend on minimum_f0 (cuz I use pyworld to get f0 and mcc coefficients)

TODO

  • Zero padding.
  • Injected noise.
  • Voiced/unvoiced conditional sampling.
  • Post-synthesis denoising.

Notes

  • I combine two 1x1 convolution kernel to one 1x2 dilated kernel. This can remove redundant bias parameters and accelerate total speed.
  • The author said in the middle layers the channels size are 128 not 256.
  • My model will get stuck at the begining (loss aroung 4.x) for thousands of step, then go down very fast to 2.6 ~ 3.0. Use smaller learning rate can help a little bit.

Variations of FFTNet

Radix-N FFTNet

Use the flag --radixs to specify each layer's radix.

# a radix-4 FFTNet with 1024 receptive field
python train.py --radixs 4 4 4 4 4

The original FFtNet use Radix-2 structure. In my experiment, a radix-4 network can still achieved similar result, even radix-8, and by reduce the number of layers, it can run faster.

Transposed FFTNet

Fig. 2 in the paper can be redraw as dilated structure with kernel size 2 (also means radix size 2).

If we draw all the lines;

and transpose the the graph to let the arrows go backward, you'll find a WaveNet dilated structure.

Add the flag --transpose, you can get a simplified version of WaveNet.

# a WaveNet-like structure model withou gated/residual/skip unit.
python train.py --transpose

In my experiment, the transposed models are more easy to train and have slightly lower training loss compare to FFTNet.