Microsoft Developer: State of GPT - YouTube
Slides - Karpathy.ai
Notes by mk2112
Table of Contents
Think of GPT-personalization as an emerging technology to adapt GPTs to your needs and behavior expectations. A current approach consists of a multi-stage process:
- Pretraining:
- Dataset: Raw internet scraped text, trillions of words with low task-specificity, in high quantity
- Algorithm: Next token prediction
- Result: Base model
- Supervised Finetuning:
- Dataset: Q-A-style behavioral demonstrations (10K to 100K), human-written, high specificity, low quantity
- Algorithm: Next token prediction
- Result: SFT model (this could be deployed)
- Reward Modeling:
- Dataset: Comparisions, may be written by human contractors
- Algorithm: Binary Classification (Answer vs. Good Answer labeling by human)
- Result: RM model
- Reinforcement Learning:
- Dataset: Prompts (10K to 100K), may be written by human contractors
- Algorithm: Reinforcement Learning (Generate tokens that maximize a perceived reward)
- Result: RL model (this could be deployed)
- Most of the computational complexity involved in creating aligned LLMs is involved here
- 1000s of GPUs, months of training, $ Millions in expenses
- The core competency of this step is arguably to be found in the data attaining process, e.g. like listed below for LlaMA; We have to gather data and turn it into a unified format
Source: LLaMA: Open and Efficient Foundation Language Models
We've got the text, now what?
Given that GPTs are mathematical models, requiring numeric inputs, we need to find a way to encode our training data meaningfully into a numeric representation. Tools like the OpenAI Tokenizer help with that. Specifically, algorithms like the state-of-the-art Byte Pair Encoding are employed.
The main idea a good tokenizer implements is that a numeric representation is both lossless and unique to the text it represents.
Source: LLaMA: Open and Efficient Foundation Language Models
Interestingly, LLaMA being smaller in parameter count achieves much higher performance than GPT-3 with its 175B parameters. This is due to longer training runs. LLaMA cost $5 million to train, requiring 2,048 NVIDIA A100 GPUs to be run for 21 days. This results in the base model LLaMA.
Now, given we have such a setup and attained the training dataset, we now need to reshape it to most efficiently expose our training process to the data.
We define:
-
$B$ as the batch size (e.g.$4$ ) -
$T$ as the maximum context length provided by an entry in the batch (e.g.$10$ here) - A special character
$<|endoftext|>$ to be part of the$T$ tokens within each entry in the batch, denoting contextual ends of documents from the training set
GPTs (Generative Pre-trained Transformers) are based on the Transformer architecture. Within a size-restricted, moving context window, the Transformer takes in previously experienced inputs and evaluates a current input in their context, without the prior inputs from said window losing quality in affecting current input interpretation based on the distance between.
Think of this as feeding contexts and not (just) individual tokens into the GPT. Given a token, given its also provided predecessors, what's the next token suggested by the model and what is it actually as stated in the dataset?
The Transformer-based GPT now gets exposed to contextual, supervised learning. This makes GPT generate a continuous probability distribution over the entire vocabulary for each position in the sequence. The information on what token actually came next, compared to the prediction, causes the improvement. If this is done sensibly and the model actually can take away how to adapt its internal distribution representations, you (hopefully) see something like this:
The (gradually) lower, the (gradually) better.
With all that money and time spent on this large-scale exposure of data to the model, we have ... no attained task-specificity. In essence, the model was trained to 'parrot' the training set as best as possible. It can't answer questions, solve tasks or anything like that. But somehow, ChatGPT, LlaMA, Open Assistant etc. can.
To do so, we derive a Question-Answer-style dataset though human contractors.
This is a high quality, low quantity dataset. In essence, we just continue the training from above, but now with question as input, and the answer as expected output.
This results in a "Supervised Fine-Tuning" model (SFT model). This model could be published. In practice, though, this is not viewed as sufficient. SFT may not fully capture the complexity and diversity needed for successful fine-tuning, especially for tasks requiring specialized knowledge or nuanced understanding. However, this challenge can be addressed through additional Reward Modeling.
To continue the pipeline, we can expose the SFT model to Reward Modeling. When combined with Reinforcement Learning, this is also known as RLHF.
Reward Modeling is based on improving through user feedback based on ranking. The SFT model produces a set of possible answers for a single prompt. The answers then are compared in quality by a human, ranking them from best to worst. Think of this as making sure the model is aligned well.
Between all possible pairs of these potential answers, we do binary classification.
To do so, we lay out the (always identical) prompts concatenated with the different responses, and we add a specific
An additional transformer model will predict at the input of the readout token how good it thinks the preceeding Q-A combination is, essentially making a guess on each completion's quality. The prediction at the input of this readout token is now intended as a prediction of quality, serving as a guide in assessing the completion-providing model's confidence.
Only now does the human-derived ranking come into play. We adapt perceived rewards through the actual ranking, nudging some scores up, some others down, making the Transformer tend towards one, more favored, option as answer. We attain a Reward Model for response quality.
This additional Reward Model in itself is small, and not really useful. But coupled to the LLM, it shines in what follow now: Reinforcement Learning
Again, a large prompt set is aquired from human contractors. Low quantity, high quality. We expose our LLM to it, again producing multiple answers per prompt. Thing is, now we keep the Reward Model fixed. It was trained, now serves as reasonably dependable indicator for response quality.
With the predictions of the Reward Model, we attain a guide by which to enforce the prediction of one certain response over others, making the best-ranked answer's associated token prediction more likely to occur. This concludes the RLHF pipeline as applied e.g. to GPT-3.5
Interestingly, RLHF-ed models gain in perceived quality of response and contextual reference, but tend to play it save on the entropy side. They tend to become less and less likely to choose possible, yet not most preferred next token predictions. This partially stems from maximizing positive feedback in the RM/RL stages, turning a model risk-averse, making it favor well-established and commonly accepted responses. This, by the way, is a key indicator for detecting AI-generated text.
Human-written sentences are interesting. They reach deep into both the author's and the reader's perception, experience and skillsets.
See how the thought process concerns a writing process, but also a process tasked with reassuring factual correctness through tool use and correcting already written text?
That's ... not how GPTs work. No internal dialogue, no reasoning as such (present, but shallow), no self-correction, no tool-use. A transformer will not reason reasonably.
The notion of self-consistency, coming up with several approaches and disregarding some, learning from that, accepting others, learning from that and doing all that independently, is really remarkable.
Source: Self-Consistency Improves Chain of Thought Reasoning in Language Models
GPT-4 can actually reflect on past answers, apologizing for prior, unfavorably sampled responses.
Ideally, though, we shouldn't need to have the model apologies and instead explore the sampling space, coming up with a best-aligned, best quality-measured answer.
Source: Tree of Thoughts: Deliberate Problem Solving with Large Language Models
A certain sense of self-reflection is emerging already, as can be seen on AutoGPT.
AutoGPT is an application of GPT-4 and GPT-3.5, creating an environment of a task creation agent, a task execution agent and a task prioritization agent, working in conjunction to generate, interlink and process self-prioritized tasks. It's a study on the extents to which unsupervised interaction with the environment is possible with current LLMs.
Source: Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions
To get back to the notion of 'How To', we have to be aware that an LLM by itself is satisfied fully through imitation, not through task-specific contribution. Additionally, a training dataset might contain multiple different perspectives on a solution to a potential prompt. The LLM on its own has no way of differentiating the qualities of answers. This is worth remembering when prompting.
Interestingly, recent advancements worked towards addressing this.
The token vocabulary of ChatGPT contains special, additional tokens. Given such a prompt, an interpreter will read them, and based on them, call external APIs, fetch the results, and concatenate them with the original prompt. This allows for lifting the restriction of a knowledge cut-off date. Data can just be fetched and added from the web. This approach also lifts the potential for factual inconsistencies, e.g. through integration of a calculator API.
LLMs that incorporate the use of tools are commonly refered to as Retrieval-augmented language models (RALMs).
Source: Toolformer: Language Models Can Teach Themselves to Use Tools
LLM capabilities concern the model memory as well. We have to place relevant information about a task in said memory for the model to perform best. Tool-use can help here, but how can Tools be most suitably established?
Emerging right now, LlamaIndex is a data framework facilitating the integration of custom data sources with LLMs. Serving as a central interface, it enables said LLMs to ingest, structure, and access private or domain-specific data. For this, LlamaIndex provides essential tools, including data connectors, indexes, and application integrations, providing a central streamlining platform for ingestion, structuring, and integration of data with LLMs. Think of LlamaIndex as a bridge, enhancing both accessibility and usability of custom data (sources) for (custom) LLM tasks.
Another emerging application is constrained prompting, meaning the request of contextually very specific, logically fitting information.
{
"id": "{id}",
"description": "{description}",
"name": "{gen('name', stop='"')}",
"age": {gen('age', regex='[0-9]+', stop=',')},
"armor": "{select(options=['leather', 'chainmail', 'plate'], name='armor')}",
"weapon": "{select(options=valid_weapons, name='weapon')}",
"class": "{gen('class', stop='"')}",
"mantra": "{gen('mantra', stop='"')}",
"strength": {gen('strength', regex='[0-9]+', stop=',')},
"items": ["{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}", "{gen('item', list_append=True, stop='"')}"]
}
Source: Guidance-AI, GitHub
Finetuning a model means changing its weights through exposure to a comparatively small dataset with the aim of inducing task-specificity in a more broadly trained base model. Thing is, the larger models, the more complex it is to finetune them.
But:
- Parameter Efficient FineTuning (PEFT) emerges, e.g. with LoRA making sure to only partially expose the model and clamp the rest as needed. This works still and also makes finetuning a lot cheaper.
- High-quality base models emerge, requiring more and more specific specific finetuning, making it more efficient
Use cases and things to remember:
- Models may be biased
- Models may fabricate (“hallucinate”) information
- Models may have reasoning errors
- Models may struggle in classes of applications, e.g. spelling related tasks
- Models have knowledge cutoffs (e.g. September 2021)
- Models are susceptible to prompt injection, “jailbreak” attacks, data poisoning attacks,…
Goal 1: Achieve your top possible performance
- Use GPT-4 (Turbo)
- Use prompts with detailed task context, relevant information, instructions
- "what would you tell a task contactor if they can’t email you back?"
- Retrieve and add any relevant context or information to the prompt
- Experiment with prompt engineering techniques (see above)
- Experiment with few-shot examples that are
- relevant to the test case,
- diverse (if appropriate)
- Experiment with tools/plugins to offload tasks difficult for LLMs (calculator, code execution, ...)
- If prompts are well-engineered (work for some time on that): Spend quality time optimizing a pipeline / "chain"
- If you feel confident that you maxed out prompting, consider SFT data collection + finetuning
- Expert / fragile / research zone: consider RM data collection, RLHF finetuning
Goal 2: Optimize costs to maintain performance
- Once you have the top possible performance, attempt cost saving measures (e.g. use GPT-3.5, find shorter prompts, etc.)
Recommendations:
- Use in low-stakes applications, combine with human oversight
- Source of inspiration, suggestions
- Copilots over autonomous agents