- This repository is dedicated to organizing affordable but powerful language models (LLMs).
- The repository providing valuable insights into the latest models, including number of parameters, fine-tuning datasets and techniques, and hardware specifications.
- With this repository, you can quickly and easily access all the vital information you need for your affordable LLM needs.
- EleutherAI: GPT-J, GPT-NEO, GPT-NEOX, Pythia / Dolly
- huggingface BigScience: BLOOM / BELLE, Phoenix
- Meta: OPT, Galactica, LLaMA / Phoenix, Alpaca, Vicuna
- LAION AI: Open-Assistant / HuggingChat
- Tsinghua: GLM / ChatGLM-6B
- Cerebras: Cerebras-GPT
- BlinkDL: RWKV
- Microsoft: DeepSpeedChat
- ColossalAI: ColossalChat
- Google: BERT, T5, Flan, Switch Transformers, LaMDA, FLAN-T5, PaLM, PaLM-E
- DeepMind: Chinchilla, Gopher, Sparrow
- Anthropic: Claude
- OpenAI: GPT-1, GPT-2, GPT-3, WebGPT, InstructGPT, ChatGPT, GPT-4
project | base model | data | finetune | hardware/Cost |
---|---|---|---|---|
Stanford/Alpaca | LLaMA-7B | 52K instruction-followling dataset, generate in self-instruct style using text-davinci-003 | SFT | 3 hours on 8 80GB A100s, $500(data) + $100(train) |
NLPCloud/instruct-gpt-j | GPT-J-6B | 52K Alpaca | SFT | fp16 model deploy well on 16GB Tesla T4 |
LianjiaTech/BELLE | BLOOMZ-7B1-mt | 2M chinese data generated in a Alpaca way | SFT | 8-bit GPTQ quantization using 12GB GPU |
LianjiaTech/BELLE | LLaMA-7B | same | SFT | 4-bit ggml quantization work well on M1 chip Mac |
Alpaca-LoRA | LLaMA-7B | 52K Alpaca; update to MSFT LLaMA-GPT4 dataset | SFT with LoRA | hours on a single RTX 4090(24GB) |
Databricks/Dolly-v1-6B | GPT-J-6B | 52K Alpaca | SFT | |
Databricks/Dolly-v2-12B | Pythia-12b | databricks-dolly-15k generated by Databricks employees in capability domains from the InstructGPT paper | SFT | about 3.5 hours on 8 V100s with fp16 to complete 1 epoch |
GPT4All | LLaMA-7B | ~800k GPT-3.5-Turbo Generations | SFT with LoRA | |
HIT&HFL/Chinese-LLaMA-Alpaca | LLaMA-7B/13B | ahout 2M chinese and english dataset | add 20K chinese sentencepiece tokens to vocab to improve chinese decoding effciency; using DeepSpeed Zero-2 | pretrain on 20GB general chinese corpus on 16 A100s; SFT with LoRA on 16 A100s |
HIT&HFL/Chinese-LLaMA-Plus-7B | LLaMA-7B | re-pretrain LLaMA on larger(120G) general corpus, fine-tune with 4M instruction dataset | SFT with LoRA(bigger rank) | |
THUDM/ChatGLM-6B | ||||
LLaMA-Adaptor | LLaMA-7B | 52K Alpaca | SFT with LLaMA-Adaptor | reduce 3 hours to 1 hour, 1.2M instead of 7B |
FastChat/Vicuna | LLaMA-7B/13B | 70K user-shared conversations gathered from ShareGPT.com | SFT, 40x larger dataset and 4x sequence length | 4/8 A100s, $140/300 for training, Impressing GPT-4 with ~90% ChatGPT Quality |
BAIR/Koala | LLaMA-13B | Around 60K dialogues shared by users on ShareGPT; Human ChatGPT Comparison Corpus (HC3), Open Source Data... | SFT with JAX/Flax | 2 epochs in 6 hours using 8 A100s, beat ChatGPT on 180 real user queries |
Baize | LLaMA-7B/13B/30B | 100k dialogs generated by letting ChatGPT chat with itself; QA and healthcare dataset | SFT with LoRA | run on A100(80GB)s |
Firefly | bloom-1b4/2b6-zh | 1.1M instruction dataset build from 23 chinese NLP tasks, BELLE-0.5M-cn | reduce vocab from 25w to 4.6w, SFT | |
Arxiv Chat | build on ChatGPT(QA), LangChain(main logic) and h2oai(UI) | |||
huggingface/StackLLaMA | LLaMA-7B | Stack Exchange dataset(10M<N<100M) | SFT + RLHF | (2+8)*7B=70GB, 80GB A100 works fine, LoRA/PEFT makes 50-60B model works on a single A100 possible |
MSFT/LLaMA-GPT4 | LLaMA-7B | 52K Alpaca prompt input using GPT-4 | SFT, RM | |
MSFT/DeepSpeed Chat | support SFT, RM, RLHF | Efficiency and Affordability | ||
ColossalAI/ColossalChat | support SFT, RM, RLHF | quick preview | ||
Phoenix | LLaMA-7B/13B | vast collection of popular multilingual open source dataset | SFT | |
fudan/MOSS-003 | MOSS-16B | ~1.1M text-davinci-003 generated self-instruct dataset, include ~300k plugins dataset as text-to-image/equations/.etc | SFT | fp16 finetune on 2 A100s or 4/8-bit finetune on single 3090 |
replit/replit-code-v1-3b | 2.7B | entirely code, 525B tokens | 10 days, benchmark better CodeX |
- SFT: Raw, LoRA, PEFT; Chinese Vocab Fixing; Instruction Dataset generated using ChatGPT/GPT4, Human labeled dataset like databricks-dolly-15k;
- RM: GPT-4 assign scores using its judging quality ability; Open Source Datasets;
- RLHF: DeepSpeedChat/ColossalChat;
-
Data & Model Parallel
- Data Parallel
- Tensor Parallel
- Pipeline Paralle
- Zero Redundancy Optimizer(ZeRO) (DeepSpeed, often work with CPU offloading)
- Sharded DDP(FSDP)
- Mixture-of-Experts (MoE)
-
Param Efficient
- LoRA
- PEFT
- Checkpointing
- Offloading(ZeRO)
- Memory Efficient Optimizers
- 16-bit mix precision
- 8-bit: bitsandbytes / triton
- 4-bit: gptq / ggml
- TruthfulQA: the benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics.
- Chinese-LLaMA-Alpaca: the chinese benchmark contains 10 tasks with 20 example for each
- https://github.com/EleutherAI/lm-evaluation-harness: blog
- MMLU: English LLM evalution
- https://github.com/Felixgithub2017/MMCU: zero/few-shot evaluation on 15 chinese tasks, contains med, law, psy, edu.