Skip to content

understanding language modeling by training a small GPT on Shakespeare plays.

Notifications You must be signed in to change notification settings

shreydan/shakespeareGPT

Repository files navigation

ShakespeareGPT

building & training GPT from scratch based off of Andrej Karpathy: Let's build GPT: from scratch, in code, spelled out. tutorial

dataset tiny-shakespeare : original with slight modifications.

tutorialGPT (following the video)

  • basic_bigramLM.py : built a basic bigram model with generate to get things rolling.
  • tutorial.ipynb : understood basic attention mechanism, using tril, masked_fill, softmax + notes on attention.
  • LMwithAttention.py : continued the model but now with single attention head, token embeddings, positional embeddings.
  • AttentionBlock.py : built a single attention head
  • LM_multihead_attention_ffwd.ipynb : continued the model to now have multiple attention heads concantenated, and a separate feed forward layer before lm_head.
  • tutorialGPT.ipynb : created the transformer block, layering, residual connections, better loss evaluation, dropout, layernorm.

Character Level GPT

used a character level tokenizer. Trained two versions with different configurations to better understand the impact of the hyperparameters such as n_embeds, num_heads.

ShakespeareGPT

used a byte-pair encoding tokenizer.

  • gpt.py : the full GPT model
  • dataset.py : torch dataset
  • build_tokenizer.py : BPE tokenizer using huggingface tokenizers from scratch similar to GPT-2 saved at tokenizer
  • train.py : training script contains optimizer, config, loss function, train loop, validation loop, model saving
  • generate.py : generate text by loading the model on CPU.

Versions

  •   V1
      n_embed = 384
      n_heads = 12
      head_size = 32
      n_layers = 4
      lr = 6e-4
      attn_dropout = 0.1
      block_dropout = 0.1
    
      Train Loss: 4.020419597625732
      Valid Loss: 6.213085174560547
    
  •   V2
      n_embed = 384
      n_heads = 6
      head_size = 64
      n_layers = 3
      lr = 5e-4
      attn_dropout = 0.2
      block_dropout = 0.2
    
      Train Loss: 3.933095216751099 
      Valid Loss: 5.970513820648193
    

as always, an incredible tutorial by Andrej!