building & training GPT from scratch based off of Andrej Karpathy: Let's build GPT: from scratch, in code, spelled out. tutorial
dataset tiny-shakespeare : original with slight modifications.
- basic_bigramLM.py : built a basic bigram model with generate to get things rolling.
- tutorial.ipynb : understood basic attention mechanism, using tril, masked_fill, softmax + notes on attention.
- LMwithAttention.py : continued the model but now with single attention head, token embeddings, positional embeddings.
- AttentionBlock.py : built a single attention head
- LM_multihead_attention_ffwd.ipynb : continued the model to now have multiple attention heads concantenated, and a separate feed forward layer before lm_head.
- tutorialGPT.ipynb : created the transformer block, layering, residual connections, better loss evaluation, dropout, layernorm.
used a character level tokenizer. Trained two versions with different configurations to better understand the impact of the hyperparameters such as n_embeds, num_heads.
used a byte-pair encoding tokenizer.
- gpt.py : the full GPT model
- dataset.py : torch dataset
- build_tokenizer.py : BPE tokenizer using
huggingface tokenizers
from scratch similar to GPT-2 saved at tokenizer - train.py : training script contains optimizer, config, loss function, train loop, validation loop, model saving
- generate.py : generate text by loading the model on CPU.
-
V1 n_embed = 384 n_heads = 12 head_size = 32 n_layers = 4 lr = 6e-4 attn_dropout = 0.1 block_dropout = 0.1 Train Loss: 4.020419597625732 Valid Loss: 6.213085174560547
-
V2 n_embed = 384 n_heads = 6 head_size = 64 n_layers = 3 lr = 5e-4 attn_dropout = 0.2 block_dropout = 0.2 Train Loss: 3.933095216751099 Valid Loss: 5.970513820648193
as always, an incredible tutorial by Andrej!