Skip to content

Latest commit

 

History

History
98 lines (74 loc) · 7 KB

chart.md

File metadata and controls

98 lines (74 loc) · 7 KB

Keeping Track of Affordable LLMs

  • This repository is dedicated to organizing affordable but powerful language models (LLMs).
  • The repository providing valuable insights into the latest models, including number of parameters, fine-tuning datasets and techniques, and hardware specifications.
  • With this repository, you can quickly and easily access all the vital information you need for your affordable LLM needs.

Base Models

  • EleutherAI: GPT-J, GPT-NEO, GPT-NEOX, Pythia / Dolly
  • huggingface BigScience: BLOOM / BELLE, Phoenix
  • Meta: OPT, Galactica, LLaMA / Phoenix, Alpaca, Vicuna
  • LAION AI: Open-Assistant / HuggingChat
  • Tsinghua: GLM / ChatGLM-6B
  • Cerebras: Cerebras-GPT
  • BlinkDL: RWKV
  • Microsoft: DeepSpeedChat
  • ColossalAI: ColossalChat
  • Google: BERT, T5, Flan, Switch Transformers, LaMDA, FLAN-T5, PaLM, PaLM-E
  • DeepMind: Chinchilla, Gopher, Sparrow
  • Anthropic: Claude
  • OpenAI: GPT-1, GPT-2, GPT-3, WebGPT, InstructGPT, ChatGPT, GPT-4

Model Spec

project base model data finetune hardware/Cost
Stanford/Alpaca LLaMA-7B 52K instruction-followling dataset, generate in self-instruct style using text-davinci-003 SFT 3 hours on 8 80GB A100s, $500(data) + $100(train)
NLPCloud/instruct-gpt-j GPT-J-6B 52K Alpaca SFT fp16 model deploy well on 16GB Tesla T4
LianjiaTech/BELLE BLOOMZ-7B1-mt 2M chinese data generated in a Alpaca way SFT 8-bit GPTQ quantization using 12GB GPU
LianjiaTech/BELLE LLaMA-7B same SFT 4-bit ggml quantization work well on M1 chip Mac
Alpaca-LoRA LLaMA-7B 52K Alpaca; update to MSFT LLaMA-GPT4 dataset SFT with LoRA hours on a single RTX 4090(24GB)
Databricks/Dolly-v1-6B GPT-J-6B 52K Alpaca SFT
Databricks/Dolly-v2-12B Pythia-12b databricks-dolly-15k generated by Databricks employees in capability domains from the InstructGPT paper SFT about 3.5 hours on 8 V100s with fp16 to complete 1 epoch
GPT4All LLaMA-7B ~800k GPT-3.5-Turbo Generations SFT with LoRA
HIT&HFL/Chinese-LLaMA-Alpaca LLaMA-7B/13B ahout 2M chinese and english dataset add 20K chinese sentencepiece tokens to vocab to improve chinese decoding effciency; using DeepSpeed Zero-2 pretrain on 20GB general chinese corpus on 16 A100s; SFT with LoRA on 16 A100s
HIT&HFL/Chinese-LLaMA-Plus-7B LLaMA-7B re-pretrain LLaMA on larger(120G) general corpus, fine-tune with 4M instruction dataset SFT with LoRA(bigger rank)
THUDM/ChatGLM-6B
LLaMA-Adaptor LLaMA-7B 52K Alpaca SFT with LLaMA-Adaptor reduce 3 hours to 1 hour, 1.2M instead of 7B
FastChat/Vicuna LLaMA-7B/13B 70K user-shared conversations gathered from ShareGPT.com SFT, 40x larger dataset and 4x sequence length 4/8 A100s, $140/300 for training, Impressing GPT-4 with ~90% ChatGPT Quality
BAIR/Koala LLaMA-13B Around 60K dialogues shared by users on ShareGPT; Human ChatGPT Comparison Corpus (HC3), Open Source Data... SFT with JAX/Flax 2 epochs in 6 hours using 8 A100s, beat ChatGPT on 180 real user queries
Baize LLaMA-7B/13B/30B 100k dialogs generated by letting ChatGPT chat with itself; QA and healthcare dataset SFT with LoRA run on A100(80GB)s
Firefly bloom-1b4/2b6-zh 1.1M instruction dataset build from 23 chinese NLP tasks, BELLE-0.5M-cn reduce vocab from 25w to 4.6w, SFT
Arxiv Chat build on ChatGPT(QA), LangChain(main logic) and h2oai(UI)
huggingface/StackLLaMA LLaMA-7B Stack Exchange dataset(10M<N<100M) SFT + RLHF (2+8)*7B=70GB, 80GB A100 works fine, LoRA/PEFT makes 50-60B model works on a single A100 possible
MSFT/LLaMA-GPT4 LLaMA-7B 52K Alpaca prompt input using GPT-4 SFT, RM
MSFT/DeepSpeed Chat support SFT, RM, RLHF Efficiency and Affordability
ColossalAI/ColossalChat support SFT, RM, RLHF quick preview
Phoenix LLaMA-7B/13B vast collection of popular multilingual open source dataset SFT
fudan/MOSS-003 MOSS-16B ~1.1M text-davinci-003 generated self-instruct dataset, include ~300k plugins dataset as text-to-image/equations/.etc SFT fp16 finetune on 2 A100s or 4/8-bit finetune on single 3090
replit/replit-code-v1-3b 2.7B entirely code, 525B tokens 10 days, benchmark better CodeX

Fine-tune Stages

  • SFT: Raw, LoRA, PEFT; Chinese Vocab Fixing; Instruction Dataset generated using ChatGPT/GPT4, Human labeled dataset like databricks-dolly-15k;
  • RM: GPT-4 assign scores using its judging quality ability; Open Source Datasets;
  • RLHF: DeepSpeedChat/ColossalChat;

Typology of efficient LLM Training

  • Data & Model Parallel

    • Data Parallel
    • Tensor Parallel
    • Pipeline Paralle
    • Zero Redundancy Optimizer(ZeRO) (DeepSpeed, often work with CPU offloading)
    • Sharded DDP(FSDP)
    • Mixture-of-Experts (MoE)
  • Param Efficient

    • LoRA
    • PEFT
    • Checkpointing
    • Offloading(ZeRO)
    • Memory Efficient Optimizers
    • 16-bit mix precision
    • 8-bit: bitsandbytes / triton
    • 4-bit: gptq / ggml

Instruction Dataset

LLM evaluation