En Français Si Vous Plait?

This is a machine learning natural language processing (NLP) project submission for the Global PyTorch Summer Hackathon! #PTSH19. Pour la documentation en français, cliquez ici!

Language Barrier Motivation

French-English translation service using natural language processing (NLP) with a vision of connecting people through language and advancing a barrier free society for billingual speakers
As a Canadian citizen, ensure respect for English and French as the offical languages of Canada and have equality of status, rights, and privileges

Applications & Market Opportunities

Customer Service
- Chatbots taking over repetitive easy-to-automate human jobs
- Ex: Bank tellers, cashiers, or sales associates
Legal Industry
- NLP used to automate or summarize long and mundane documents
- Ex: One legal case has an average of 400-500 pages
Financial Industry
- Reduce the manual processing required to retrive corporate data
- Ex: Information from financial reports, press releases, or news articles

Technical Tools

Pytorch
- Open source deep learning research platform that provides maximum flexibility and speed and provides tensors that live on the GPU accelerating the computation
Facebook Research's Fairseq
- Sequence modeling toolkit written in PyTorch
- Train custom models for Neural Machine Translation (NMT) - translation, summarization, language modeling, and other text generation tasks
Transformer Machine Learning Model with Sequence-Aligned RNNs or CNNs
- Machine language translation transformer model from Attention Is All You Need using encoder-decoder attention mechanisms in a sequence-to-sequence model that features stacked self attention layers
- Transduction model relying on self-attention layers to compute input and output represenations where the attention functions maps [query, key-value pairs] to vector outputs of [query, key-value pairs]
Image of Transformer model. The encoder maps sequence X_n (x_1, x_2 ... x_n) --> sequence Z_n (z_1, z_2 ... z_n). From Z_n, the decoder generates sequence Y_n (y_1, y_2 ... y_n) element by element. Image Source

Convolutional Self-Attention Transformer Modelling

Measure speed translations
- Record the translation time once machine learning system is shown a sentence to quantify results
- "On the WMT 2014 English-to-French translation task (a widely used benchmark metric for judging the accuracy of machine translation), attention model establishes a BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature"
Image. Transformer model high in BLEU scale and low on training costs Image Source
Gating to control flow of hidden-units
Multi-Hop Attention Functionality
- Self attention layers - where all the keys, values, and queries come from the same input
- CNN encoder creates a vector for each word to be translated, and CNN decoder translates words while PyTorch computations are being simultaneously made. Network has two decoder layers and attention is paid to each layer. Refer to image below.
Image of Multi-hop Attention tensor computations where green lines represent attention paid to each French word. Image Source

French-English Translation Dataset

Statistical machine translation WMT 2014 French-English Benchmark with corpus size 2.3GB and 36 million sentence pairs. Dataset too big to include in repo - download and extract to /data/iwslt14/ to replace iwslt14.en.txt and iwslt14.fr.txt
For French-English translations, order of words matter and and the number of words can be added during the translation. "Black cat" translate to "chat noir" and the "not" translate to "ne ___ pas". Refer to image below:
Dataset includes: Commoncrawl, Europarl-v7, Giga, News-commentary, and Undoc data

Image of sentence sequence prediction. Image Source

Pre-Process the WMT2014 Text Data

cd data/
bash prepare-iwslt14.sh

TEXT=data/iwslt14.tokenized.fr-en

Binarize data
$ fairseq preprocess -sourcelang fr -targetlang en \
    -trainpref $TEXT/train -validpref $TEXT/valid -testpref $TEXT/test \
    -thresholdsrc 3 -thresholdtgt 3 -destdir data-bin/iwslt14.tokenized.fr-en
    -workers 60

Technical Train the French-English Model

Train new CNN model (dropout rate of 0.2) with -fairseq train

$ mkdir -p trainings/fconv

$ fairseq train -sourcelang fr -targetlang en -datadir data-bin/iwslt14.tokenized.fr-en \
  -model fconv -nenclayer 4 -nlayer 3 -dropout 0.2 -optim nag -lr 0.25 -clip 0.1 \
  -momentum 0.99 -timeavg -bptt 0 
  -savedir trainings/fconv

Model Generation with -fairseq generate

$ DATA=data-bin/iwslt14.tokenized.fr-en

$ fairseq generate-lines -sourcedict $DATA/dict.fr.th7 -targetdict $DATA/dict.en.th7 \
  -path trainings/fconv/model_best_opt.th7 -beam 10 -nbest 
| [target] Dictionary: 24738 types
| [source] Dictionary: 35474 types

> Je ne crains pas de mourir.

Source: Je ne crains pas de mourir.
Original_Sentence: Je ne crains pas de mourir.
Hypothesis: -0.23804219067097 I am not scared of dying.
Attention_Maxima: 2 2 3 4 5 6 7 8 9
Hypothesis: -0.23861141502857 I am not scared of dying.
Attention_Maxima: 2 2 3 4 5 7 6 7 9 9

Visualizing Attention

Step by step visualization of the encoder-decoder network attention matrix as it goes through a sentance translation. Use matplotlib library to display matrix via plt.matshow(attention) :

Image of attention matrix. Input steps vs output steps with the sample sentece "Je ne crains pas de mourir."

References

"Attention Is All You Need": https://arxiv.org/abs/1706.03762
Fairseq Technical Documentation: https://fairseq.readthedocs.io/en/latest/models.html#module-fairseq.models.transformer
A fast, batched Bi-RNN(GRU) encoder & attention decoder implementation in PyTorch: https://github.com/AuCson/PyTorch-Batch-Attention-Seq2seq
Gated-Attention Architectures for Task-Oriented Language Grounding: https://github.com/devendrachaplot/DeepRL-Grounding
Sequence to Sequence models with PyTorch: https://github.com/MaximumEntropy/Seq2Seq-PyTorch
PyTorch implementation of OpenAI's Finetuned Transformer Language Model paper "Improving Language Understanding by Generative Pre-Training": https://github.com/huggingface/pytorch-openai-transformer-lm
IBM's PyTorch seq-to-seq model:https://github.com/IBM/pytorch-seq2seq
Translation with Sequence to Sequence Network and Attention: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#sphx-glr-intermediate-seq2seq-translation-tutorial-py
"Convolutional Sequence to Sequence Learning": https://arxiv.org/abs/1705.03122
Data processing scripts: https://www.dagshub.com/Guy/fairseq/src/67af40c9cca0241d797be13ae557d59c3732b409/data
Beyond "How May I Help You?": https://medium.com/airbnb-engineering/beyond-how-may-i-help-you-fd6a0d385d02

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
data/iwslt14		data/iwslt14
fairseq		fairseq
screenshots		screenshots
testing		testing
LICENSE		LICENSE
README-fr.md		README-fr.md
README.md		README.md
en_francis_si_vous_plait slide deck.pptx		en_francis_si_vous_plait slide deck.pptx
eval_lm.py		eval_lm.py
generate.py		generate.py
hubconf.py		hubconf.py
interactive.py		interactive.py
ma famille pic.png		ma famille pic.png
preprocess.py		preprocess.py
requirements.txt		requirements.txt
score.py		score.py
setup.py		setup.py
train.py		train.py
translate-fr.py		translate-fr.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

En Français Si Vous Plait?

Language Barrier Motivation

Applications & Market Opportunities

Technical Tools

Convolutional Self-Attention Transformer Modelling

French-English Translation Dataset

Pre-Process the WMT2014 Text Data

Technical Train the French-English Model

Model Generation with -fairseq generate

Visualizing Attention

References

About

Releases

Packages

Languages

License

lucylow/En_francais_si_vous_plait-

Folders and files

Latest commit

History

Repository files navigation

En Français Si Vous Plait?

Language Barrier Motivation

Applications & Market Opportunities

Technical Tools

Convolutional Self-Attention Transformer Modelling

French-English Translation Dataset

Pre-Process the WMT2014 Text Data

Technical Train the French-English Model

Model Generation with -fairseq generate

Visualizing Attention

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages