From 0fe5a1063df85b6d70d9b7a7e29ce950c573ce42 Mon Sep 17 00:00:00 2001 From: Zhenyu REN <2391928709@qq.com> Date: Mon, 25 Mar 2024 15:05:48 +0800 Subject: [PATCH] Create llm.md --- content/post/llm.md | 151 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 151 insertions(+) create mode 100644 content/post/llm.md diff --git a/content/post/llm.md b/content/post/llm.md new file mode 100644 index 0000000..f64cd38 --- /dev/null +++ b/content/post/llm.md @@ -0,0 +1,151 @@ +--- +title: "Large Language Model (LLM) Notes" +date: 2024-03-20T15:08:24+08:00 +lastmod: 2024-03-20T15:08:24+08:00 +draft: false +keywords: [] +description: "" +tags: ["notes","math","llM"] +categories: ["notes"] +author: "Ren Zhenyu" + +# You can also close(false) or open(true) something for this content. +# P.S. comment can only be closed +comment: true +toc: true +autoCollapseToc: true +postMetaInFooter: true +hiddenFromHomePage: false +# You can also define another contentCopyright. e.g. contentCopyright: "This is another copyright." +contentCopyright: MIT +reward: false +mathjax: true +mathjaxEnableSingleDollar: true +mathjaxEnableAutoNumber: false + +# You unlisted posts you might want not want the header or footer to show +hideHeaderAndFooter: false + +# You can enable or disable out-of-date content warning for individual post. +# Comment this out to use the global config. +#enableOutdatedInfoWarning: false + +flowchartDiagrams: + enable: true + options: "" + +sequenceDiagrams: + enable: true + options: "" +typora-copy-images-to: ../../static/llm.assets +typora-root-url: ../../static +--- + + + +> Based on [https://llm-course.github.io](https://llm-course.github.io/). + +## Basics + +### Language Model + +A language model assigns probability to $N$-gram: $f:V^n \rightarrow R^+$. + +A conditional language model assigns probability to a word given some conditioning context: + +$$ +g:(V^{n-1},V)\rightarrow R^{+}. +$$ + +$$ +p(w_n|w_1,\ldots,w_{n-1}) = g(w_1,\ldots,w_{n-1},w) = \frac{f(w_1,\ldots,w_{n})}{f(w_1,\ldots,w_{n-1})}. +$$ + +A probabilistic model that assigns a probability to every finite sequence (grammatical or not): +$$ +P(\text{I am noob})=\underbrace{P(\text{I})*P(\text{am}|\text{I})}_{P(\text{I am})}*P(\text{noob}|\text{I am}). +$$ +Decoder-only models (GPT-x models). + +Encoder-only models (BERT, RoBERTa, ELECTRA). + +Encoder-decoder models (T5, BART). + +### language modeling with $n$-gram + +{{% admonition%}} + +**Definition for $n$-gram:** + +An $n$-gram language model assumes each word depends only on the last $n−1$ words (markov assumptions with order $n-1$​): +$$ +\begin{align} +P_{ngram}(w_1,\ldots,w_N)&=P(w_1)P(w_2|w_1)\ldots P(w_i|\underbrace{w_{i-1},\ldots,w_{i-(n+1)})}_{n-1 \text{ items}}\ldots P(w_N|w_{N-1},\ldots,w_{N-(n+1)}))\\ +&=\prod_{i=1}^N P(w_i|w_{i-1},\ldots,w_{i-(n+1)}). +\end{align} +$$ + ++ Use EOS (end-of-sentence) `` token to limit sentence length. ++ Add $n-1$ beginning-of-sentence (BOS) (``) to each sentence for an $n$-gram mode to ensure consistency for $P(w_i|w_{i-1},\ldots,w_{i-(n+1))}$ for the first $n-1$ items. + +*** + ++ Unigram model ($1$-gram​): $P(w_1,\ldots,w_i)=\prod_{k=1}^iP(w_k)$. ++ Bigram model ($2$-gram): $P(w_1,\ldots,w_i)=\prod_{k=1}^iP(w_k|w_{k-1})$. ++ Trigram model ($3$-gram): $P(w_1,\ldots,w_i)=\prod_{k=1}^iP(w_k|w_{k-1},w_{k-2})$​. + +{{% /admonition %}} + +### Evaluation + +{{% admonition %}} + +**(Intrinsic Evaluation). Perplexity:** + +Inverse ($1\over P(\ldots)$) of the probability of the test set, normalized ($\sqrt[N]{\ldots}$) by the # of tokens ($N$) in the test set. + +If a LM assigns probability $P(w_1,\ldots,w_N)$ to a test corpus $w_1,\ldots,w_N$, the perplexity for $n$-gram language model could be written as: +$$ +PP(w_1,\ldots,w_N)=\sqrt[N]{1 \over P(w_1,\ldots,w_N)}=\sqrt[N]{1 \over \prod_{i=1}^N P(w_i|w_{i-1},\ldots,w_{i-(n+1)})}. +$$ +Rewrite it into log-form (exponent of mean of log likelihood of all the words in an input sequence): +$$ +PPL(w_1,\ldots,w_N)=\exp\left(-\frac{1}{N}\sum_{i=1}^N\log\left(P(w_i|w_{i-1},\ldots,w_{i-(n+1)})\right)\right). +$$ + ++ Lower perplexity $\rightarrow$ a higher probability to the unseen test corpus. + +{{% /admonition %}} + +{{% admonition %}} + +**(Extrinsic Evaluation). Word error rate (WER):** +$$ +\text{WER} = \frac{\text{Insertions}+\text{Deletions}+\text{Substitutions}}{\text{Actual words in transcript}}. +$$ +{{% /admonition %}} + + +### ChatGPT + ++ Phase 1: pre-training: Learn general world knowledge, ability, etc. ++ Phase 2: Supervised finetuning: Tailor to tasks (unlock some abilities) . ++ Phase 3: Reinforcement Learning and Human Feedback (RLHF). + + [ChatGPT 背后的“功臣”——RLHF 技术详解](https://huggingface.co/blog/zh/rlhf) + +## Resource + +ChatGPT API key: + +课程: + +学者: ++ (通信)黄川,hongyang du, ++ Open AI and famous guys, Lilian Weng, Yao Fu, Jianlin Su + +图片$\rightarrow$文字:llava + +大模型做network调度:netllm + +大模型微调:LoRA +