This repository contains materials for the Natural Language Processing course.
Tip #1:
Loading the entire repository can take a considerable amount of time. A single folder can be downloaded via DownGit.
Tip #2:
Sometimes GitHub failes to render a notebook. In that case use nbviewer — it works like a charm!
Tip #3:
In those cases when nbviewer fails to find a notebook whereas GitHub finds it just fine, try to add ?flush_cache=false
at the end of the nbviewer link.
Legend: — slides, — code, — video.
Week | What | Where | When |
---|---|---|---|
1 | Tasks in NLP, text preprocessing (tokenization, normalization (stemming, lemmatization)), feature extraction (Bag-of-Words, Bag-of-Ngramms, TF-IDF), word embeddings (one-hot, matrix factorization, word2vec, CBOW, Skip-gram, GloVe). | 10.03.2021 | |
2 | Embeddings: recap (word2vec), usage in unsupervised translation; cosine distance; RNNs, CNNs, n-grams, and their usage examples. | 17.03.2021 | |
3 | Recap: RNN; LSTM, gates in LSTM; RNNs as encoders for sequential data; vanishing gradient problem; exploding gradient problem. | 24.03.2021 | |
4 | Neural Machine Translation (NMT): problem statement, historical overview, statistical MT, beam search, BLEU/perplexity scores; Encoder-Decoder architecture, attention. | 31.03.2021 | |
5 | Recap: attention in seq2seq; Transformer architecture, self-attention. | 07.04.2021 | |
6 | Recap: self-attention; positional encoding, layer normalization, decoder in Transformer. | 14.04.2021 | |
7 | OpenAI Transformer (pre-training decoder for language modeling), ELMo (deep contextualized word representations), BERT. | 21.04.2021 | |
8 | ULMFiT, Transformer-XL, Question Answering (SQuAD, SberQuAD, ODQA), GPT. | 28.04.2021 |
Additional materials:
- word embeddings:
- Word Embeddings (by Lena Voita)
- Word2vec tutorial
- Illustrated word2vec (by Jay Alammar)
- CNNs:
- LSTM and PoS tagging:
- Transformers:
- Illustrated Transformer (by Jay Alammar)
- The Annotated Transformer (by Harvard NLP group)
- BERT:
- The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) (by Jay Alammar)
- Simple tutorial for distilling BERT (by Paul Gladkov)
- Huggingface Transformers
- Question Answering and TTS:
- SberQuAD — Russian Reading Comprehension Dataset: Description and Analysis
- GPT-3 for Russian language
- Voice cloning
- Tacotron 2 Demo (by NVIDIA)
- Voice datasets (by Mozilla)
- Speech recognition and synthesis (ASR and TTS) (by DeepPavlov)
- Russian Open Speech To Text (STT/ASR) Dataset
- DeepSpeech 0.6: Mozilla’s Speech-to-Text Engine Gets Fast, Lean, and Ubiquitous (by Reuben Morais)
- Open Domain Question Answering Skill on Wikipedia (by DeepPavlov)