An unofficial PyTorch implementation of the model described in "LipNet: End-to-End Sentence-level Lipreading" by Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. Based on the official Torch implementation.
First, create symbolic links to where you store images and alignments in the data
folder:
mkdir data
ln -s PATH_TO_ALIGNS data/align
ln -s PATH_TO_IMAGES data/images
Then run the program:
python3 train_lipnet.py
This trains on the "unseen speakers" split. To train on the "overlapped speakers" split:
python3 train_lipnet.py --test_overlapped
The overlapped speakers file list we use (list_overlapped.json
) is exported directly from the authors' Torch implementation release here.
To monitor training progress:
tensorboard --logdir logs
The images
folder should be organised as:
├── s1
│ ├── bbaf2n
│ │ ├── mouth_000.png
│ │ ├── mouth_001.png
...
And the align
folder:
├── s1
│ ├── bbaf2n.align
│ ├── bbaf3s.align
│ ├── bbaf4p.align
...
That's it! You can specify the GPU to use in the program where the environment variable CUDA_VISIBLE_DEVICES
is set. Feel free to play around with the parameters.
- Python 3.x
- PyTorch 1.1+ (for native CTCLoss and TensorBoard support; we highly recommend using nightly builds, because PyTorch CTC is quite buggy and often fixes are not reflected in due course.)
- tensorboardX (if you are not using PyTorch 1.1+, or your TensorFlow version is incompatible with native PyTorch Tensorboard support)
- ctcdecode (for beam search decoding)
- torchsummary
- progressbar2
- editdistance
- scikit-image
- torchvision
- pillow
TODO
- Add saliency visualisation
- Add preprocessing code