Implementation of autoregressive language model(like GPT) using improved Transformer and DeepSpeed pipeline parallelism.
Transformer used in this repository attempts to improve the transformer using the additional modules below.
Name | Description | Link |
---|---|---|
Rezero | Rezero Is All You Need | link |
Explicit Sparse Transformer | Concentrated Attention Through Explicit Selection | link |
Macaron Architecture | Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View | link |
RealFormer | Residual Attention | link |
ALiBi Position Embedding | effective relative positional encoding |
model_name | n_params | n_layer | d_model | n_heads | vocab_size | max_seq_len | learning_rate |
---|---|---|---|---|---|---|---|
GPT-X 1B | 1B | 20 | 2048 | 16 | 22000 | 1024 | 2.0 x 10^-4 |
DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale.
You can train 1B GPT-X Model using deepspeed pipeline parallelism on 2 V100 GPU(16G).
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:00:06.0 Off | 0 |
| N/A 42C P0 44W / 250W | 16076MiB / 16130MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... On | 00000000:00:07.0 Off | 0 |
| N/A 45C P0 168W / 250W | 16060MiB / 16130MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 29525 C /home/ubuntu/anaconda3/bin/python 16065MiB |
| 1 29528 C /home/ubuntu/anaconda3/bin/python 16049MiB |
+-----------------------------------------------------------------------------+
[2021-12-31 12:24:20,042] [INFO] [engine.py:93:__init__] CONFIG: micro_batches=4 micro_batch_size=1
[2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=1 STAGE=1 LAYERS=12 [11, 23) STAGE_PARAMS=548560916 (548.561M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M)
[2021-12-31 12:24:20,094] [INFO] [engine.py:151:__init__] RANK=0 STAGE=0 LAYERS=11 [0, 11) STAGE_PARAMS=550653972 (550.654M) TOTAL_PARAMS=1099214888 (1099.215M) UNIQUE_PARAMS=1099214888 (1099.215M)
[2021-12-31 12:24:08,793] [INFO] [module.py:365:_partition_layers] Partitioning pipeline stages with method parameters
stage=0 layers=11
0: Embedding
1: ReZeroSparseTopkDecoder
2: ReZeroSparseTopkDecoder
3: ReZeroSparseTopkDecoder
4: ReZeroSparseTopkDecoder
5: ReZeroSparseTopkDecoder
6: ReZeroSparseTopkDecoder
7: ReZeroSparseTopkDecoder
8: ReZeroSparseTopkDecoder
9: ReZeroSparseTopkDecoder
10: ReZeroSparseTopkDecoder
stage=1 layers=12
11: ReZeroSparseTopkDecoder
12: ReZeroSparseTopkDecoder
13: ReZeroSparseTopkDecoder
14: ReZeroSparseTopkDecoder
15: ReZeroSparseTopkDecoder
16: ReZeroSparseTopkDecoder
17: ReZeroSparseTopkDecoder
18: ReZeroSparseTopkDecoder
19: ReZeroSparseTopkDecoder
20: ReZeroSparseTopkDecoder
21: LayerNorm
22: Linear
loss: cross_entropy
-
ReZero -
RealFormer, Residual Attention -
Macaron architectures -
Macaron architectures - layer Scale 0.5 -
Explicit Sparse Transformer -
torch lightning -
Deepspeed train on single GPU - apply wandb
- Deepspeed pipeline parallel trainig on 2 V100 GPU with 16GB Memory
GPT-3 has a 175B parameter, and the size of the model is important for few-shot learning. In this repository, I try to pretrain language model as large as possible using 2 V100 GPUs.
model_name | n_params | n_layer | d_model | n_heads | d_head | batch_size | learning_rate |
---|---|---|---|---|---|---|---|
GPT-3 175B | 175B | 96 | 12288 | 96 | 128 | 3.2M | 0.6 x 10^-4 |
GPT-3 13B | 13B | 40 | 5140 | 40 | 128 | 2M | 1.0 x 10^-4 |
GPT-3 6.7B | 6.7B | 32 | 4096 | 32 | 128 | 2M | 1.2 x 10^-4 |
GPT-3 2.7B | 2.7B | 32 | 2560 | 32 | 80 | 1M | 1.6 x 10^-4 |
GPT-3 1.3B | 1.3B | 24 | 2048 | 24 | 128 | 1M | 2.0 x 10^-4 |
-
AttributeError: module 'deepspeed' has no attribute 'zero'
: reinstall deepspeed -
userwarning: cuda initialization: the nvidia driver on your system is too old
: reinstall pytorch following by cuda version my solution-GPU V100, cuda 10.1pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
-
can't find CUDA_HOME path
: reinstall cuda
Transformer
DeepSpeed
ReZero
Explicit Sparse Transformer
Macaron Architecrue
RealFormer Residual Attention
DeepSpeed
Pipeline Parallelism