Code mainly taken from: nanoGPT
NanoGPT is a decoder-only transformer based language model. Based on the architecture, each subsection contains essential conception to understand how nanoGPT works.
- Encoder takes a string and output a list of integers.
- It is done in the preparation phase (
prepare.py
, will get.bin
files for training and testing. - In the code the tiktoken from openAI is used.
- Token embeddings: represent words in matrix
- Positional embeddings
- Since the Encoder inside the Transformer simultaneously processes the entire input sequence, the information about the position of the element needs to be encoded inside its embedding. That is why the Positional embedding layer is used, which sums embeddings with a vector of the same dimension:
x = self.transformer.drop(tok_emb + pos_emb
- . Absolute Positional Embeddings
- Assign a unique vector to each position in the input sequence to encode positional information into the model.
- NanoGPT's default embeddings
- Positional embeddings (PE)
Since the Encoder inside the Transformer simultaneously processes the entire
input sequence, the information about the position of the element needs to be encoded inside its embedding. That is why the Positional embedding layer is used, which sums embeddings with a vector of the same dimension:
x = self.transformer.drop(tok_emb + pos_emb)
We can consider these three Positional Embeddings which are mentioned from the talk 04.12
- ROPE (Rotary Positional Embeddings)
- Encode positional information by applying rotational transformations to input embeddings
- Relative Positional Embeddings
- Instead of encoding the absolute position, focus on the relative distances between tokens in a sequence.
Feature | Absolute Positional Embeddings | Relative Positional Embeddings | ROPE |
---|---|---|---|
Position Representation | Unique absolute position | Relative distances between positions | Relative information via rotation |
Scalability | Limited in some fixed implementations | Generalizable to varying lengths | Well-suited for long sequences |
Computational Efficiency | Simple | Higher complexity | Highly efficient |
Use Cases | Standard Transformers | NLP tasks with context sensitivity | Large models like GPTs |
Each transformer block contains the following part: When we have the output of one transformer block, we pass it to the next (transformer) block.
- This is an operation that normalizes the values in each column of the matrix separately.
- It helps improve the stability of the model during training.
It's the phase where the columns in our input embedding matrix "talk" to each other
- Q: what am I looking for
- K: what am I containing
- W: dot product of Q*K (the attention score)
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
In the code, flash is used to accelerate the compute process
self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
y = torch.nn.functional.scaled_dot_product_attention
After the self-attention and a layer normalization, is the MLP (multi-layer perceptron), it's a simple neural network with two layers. There are 3 steps inside MLP:
- A linear transformation with a bias added, to a vector of length
4 * config.n_embd
. - A GELU activation function (element-wise), introduce some non-linearity into the model.
- A linear transformation with a bias added, back to a vector of length
config.n_embd
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
in model.py
optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)
always first active the environment with source venv/bin/activate
and install the requirements.txt.
if there are some dependency import errors, try pip install transformers datasets tiktoken tqdm wandb numpy
- download pytorch nightly for the first time
pip install \
--pre torch torchvision torchaudio \
--extra-index-url https://download.pytorch.org/whl/nightly/cpu
- prepare the training data
cd data
python prepare.py
% need to input the txt path like `emotion/robot_human_tag/59k_eachconv_eot.
txt`
- train the model
--device
can be set tocpu
or speciallymps
on MacBooks.max_iters
in the train.py is set to30100
, it runs forever~ use^ C
to stop the training at anytime.
cd ./src/nanoGPT
# Training for the fist time
time python train.py \
--data_dir=data/emotion/with_gpt_data/ \
# other configuration:
--n_layer=4 \
--n_head=4 \
--n_embd=64 \
--compile=False \
--eval_iters=1 \
--block_size=64 \ # default 64
--batch_size=8 \
--device=mps \
--init_from=resume \ # for continue training
--pos_embd=rope \ # change the position embeding
- When the training starts, hit
^ A
so later we can copy all the logs to this website, then get the log graph. (we will definitely improve the logging later ) - pos_embd options: default, rope, relative
configure init_from
based on where the trained model is saved
cd ./src/models/nanoGPT
python chat.py block_size=64/withoutemotion/wholeConversation
python chat.py block_size=64/withoutemotion/singleConversation
python chat.py block_size=64/withemotion
python chat.py block_size=64/withoutemotion/singleConversation_withGPTdata
python chat.py block_size=64/withcontext
python chat.py block_size=256/singleConversation_withGPTdata
TODO: Update the evaluation here
cd ./src/models/nanoGPT
python evaluation.py
When introducing Relative Positional Embeddings to an existing model, the structure of the model changes, potentially causing compatibility issues when loading old checkpoint files. Specifically, older checkpoints will not contain the new parameter, transformer.relative.relative_embeddings.weight
, leading to errors during the loading process.
To address this issue, the following function was created to update checkpoints by adding the missing parameters in ckpt_update.ipynb
-
Purpose
The function ensures that old checkpoint files can be loaded into the updated model without errors by checking for and adding the new parametertransformer.relative.relative_embeddings.weight
if it is missing. -
How It Works
- Checking the State Dictionary:
The function inspects thestate_dict
of the checkpoint for the keytransformer.relative.relative_embeddings.weight
. - Adding Missing Parameter:
If the key is not present, the function initializes the missing parameter as a zero tensor with the same shape as the corresponding weight in the model. - Updating the Checkpoint:
The modifiedstate_dict
is then reassigned to the checkpoint, ensuring compatibility with the updated model.
- Checking the State Dictionary:
-
Example Output
When a missing parameter is detected and added, the function prints a message to indicate the addition:Adding transformer.relative.relative_embeddings.weight to checkpoint.
-
File Naming Convention
- Old Checkpoints (
ckpt_original.pt
):
These files represent the checkpoint files saved before introducing relative positional embeddings. They do not contain thetransformer.relative.relative_embeddings.weight
parameter. - Updated Checkpoints (
ckpt.pt
):
After running theupdate_checkpoint
function, the updated checkpoint files include the required parameter and can be loaded into the new model without errors.
- Old Checkpoints (
-
Integration
This function should be applied to all old checkpoint files before loading them into the updated model. This guarantees that the model's structure matches the expected state and avoids runtime errors.