Skip to content

Commit

Permalink
Create llm.md
Browse files Browse the repository at this point in the history
  • Loading branch information
rzy0901 committed Mar 25, 2024
1 parent 82eeff8 commit 0fe5a10
Showing 1 changed file with 151 additions and 0 deletions.
151 changes: 151 additions & 0 deletions content/post/llm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
title: "Large Language Model (LLM) Notes"
date: 2024-03-20T15:08:24+08:00
lastmod: 2024-03-20T15:08:24+08:00
draft: false
keywords: []
description: ""
tags: ["notes","math","llM"]
categories: ["notes"]
author: "Ren Zhenyu"

# You can also close(false) or open(true) something for this content.
# P.S. comment can only be closed
comment: true
toc: true
autoCollapseToc: true
postMetaInFooter: true
hiddenFromHomePage: false
# You can also define another contentCopyright. e.g. contentCopyright: "This is another copyright."
contentCopyright: MIT
reward: false
mathjax: true
mathjaxEnableSingleDollar: true
mathjaxEnableAutoNumber: false

# You unlisted posts you might want not want the header or footer to show
hideHeaderAndFooter: false

# You can enable or disable out-of-date content warning for individual post.
# Comment this out to use the global config.
#enableOutdatedInfoWarning: false

flowchartDiagrams:
enable: true
options: ""

sequenceDiagrams:
enable: true
options: ""
typora-copy-images-to: ../../static/llm.assets
typora-root-url: ../../static
---

<!--more-->

> Based on [https://llm-course.github.io](https://llm-course.github.io/).
## Basics

### Language Model

A language model assigns probability to $N$-gram: $f:V^n \rightarrow R^+$.

A conditional language model assigns probability to a word given some conditioning context:

$$
g:(V^{n-1},V)\rightarrow R^{+}.
$$

$$
p(w_n|w_1,\ldots,w_{n-1}) = g(w_1,\ldots,w_{n-1},w) = \frac{f(w_1,\ldots,w_{n})}{f(w_1,\ldots,w_{n-1})}.
$$

A probabilistic model that assigns a probability to every finite sequence (grammatical or not):
$$
P(\text{I am noob})=\underbrace{P(\text{I})*P(\text{am}|\text{I})}_{P(\text{I am})}*P(\text{noob}|\text{I am}).
$$
Decoder-only models (GPT-x models).

Encoder-only models (BERT, RoBERTa, ELECTRA).

Encoder-decoder models (T5, BART).

### language modeling with $n$-gram

{{% admonition%}}

**Definition for $n$-gram:**

An $n$-gram language model assumes each word depends only on the last $n−1$ words (markov assumptions with order $n-1$​):
$$
\begin{align}
P_{ngram}(w_1,\ldots,w_N)&=P(w_1)P(w_2|w_1)\ldots P(w_i|\underbrace{w_{i-1},\ldots,w_{i-(n+1)})}_{n-1 \text{ items}}\ldots P(w_N|w_{N-1},\ldots,w_{N-(n+1)}))\\
&=\prod_{i=1}^N P(w_i|w_{i-1},\ldots,w_{i-(n+1)}).
\end{align}
$$

+ Use EOS (end-of-sentence) `</s>` token to limit sentence length.
+ Add $n-1$ beginning-of-sentence (BOS) (`<s>`) to each sentence for an $n$-gram mode to ensure consistency for $P(w_i|w_{i-1},\ldots,w_{i-(n+1))}$ for the first $n-1$ items.

***

+ Unigram model ($1$-gram​): $P(w_1,\ldots,w_i)=\prod_{k=1}^iP(w_k)$.
+ Bigram model ($2$-gram): $P(w_1,\ldots,w_i)=\prod_{k=1}^iP(w_k|w_{k-1})$.
+ Trigram model ($3$-gram): $P(w_1,\ldots,w_i)=\prod_{k=1}^iP(w_k|w_{k-1},w_{k-2})$​.

{{% /admonition %}}

### Evaluation

{{% admonition %}}

**(Intrinsic Evaluation). Perplexity:**

Inverse ($1\over P(\ldots)$) of the probability of the test set, normalized ($\sqrt[N]{\ldots}$) by the # of tokens ($N$) in the test set.

If a LM assigns probability $P(w_1,\ldots,w_N)$ to a test corpus $w_1,\ldots,w_N$, the perplexity for $n$-gram language model could be written as:
$$
PP(w_1,\ldots,w_N)=\sqrt[N]{1 \over P(w_1,\ldots,w_N)}=\sqrt[N]{1 \over \prod_{i=1}^N P(w_i|w_{i-1},\ldots,w_{i-(n+1)})}.
$$
Rewrite it into log-form (exponent of mean of log likelihood of all the words in an input sequence):
$$
PPL(w_1,\ldots,w_N)=\exp\left(-\frac{1}{N}\sum_{i=1}^N\log\left(P(w_i|w_{i-1},\ldots,w_{i-(n+1)})\right)\right).
$$

+ Lower perplexity $\rightarrow$ a higher probability to the unseen test corpus.

{{% /admonition %}}

{{% admonition %}}

**(Extrinsic Evaluation). Word error rate (WER):**
$$
\text{WER} = \frac{\text{Insertions}+\text{Deletions}+\text{Substitutions}}{\text{Actual words in transcript}}.
$$
{{% /admonition %}}


### ChatGPT

+ Phase 1: pre-training: Learn general world knowledge, ability, etc.
+ Phase 2: Supervised finetuning: Tailor to tasks (unlock some abilities) .
+ Phase 3: Reinforcement Learning and Human Feedback (RLHF).
+ [ChatGPT 背后的“功臣”——RLHF 技术详解](https://huggingface.co/blog/zh/rlhf)

## Resource

ChatGPT API key: <http://eylink.cn>

课程:<https://github.com/mlabonne/llm-course><https://llm-course.github.io>

学者:
+ (通信)黄川,hongyang du,
+ Open AI and famous guys, Lilian Weng, Yao Fu, Jianlin Su

图片$\rightarrow$文字:llava

大模型做network调度:netllm

大模型微调:LoRA

0 comments on commit 0fe5a10

Please sign in to comment.