-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lit-llama benchmarks (logits, autoregressive generation, lora fine tuning) #1730
Conversation
…e tuning) Signed-off-by: Edward Z. Yang <ezyang@meta.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits otherwise thanks! Let's see how long CI will take now lol
def train(self): | ||
logits = self.model(*self.example_inputs) | ||
logits.sum().backward() | ||
# meh this sucks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xd, this might be a good dataset https://huggingface.co/datasets/OpenAssistant/oasst1
Even finetuning on two examples of questions you make up might be not bad as a sanity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix this later, I think. Not needed for dynamo benchmarks.
def eval(self): | ||
self.model.eval() | ||
with torch.no_grad(): | ||
y = self.model(*self.example_inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mind printing the input prompt and the output, will be nice to do vibe checks later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but I don't want to print it here, because then the detokenization would also count as part of the benchmark?
self.model = GenerationWrapper(self.model) | ||
tokenizer = Tokenizer(os.path.join(LIT_LLAMA_PATH, "checkpoints/lit-llama/tokenizer.model")) | ||
# max_new_tokens matches lit-llama/generate.py | ||
self.example_inputs = (tokenizer.encode("The meaning of life is", bos=True, eos=False, device=device), 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 50 the max number of tokens to generate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. How large is the checkpoint file and is there any rules on the accessing frequency? If we download it too frequently (every CI workflow and every nightly testing workflow), the server might ban our access.
@xuzhao9 this will be a common workflow for LLM work (SAM is similar today), it might make sense to cache these files in a github artifact or an S3 bucket if github has data size limits |
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Do we have any precedent for hosting it in S3? I am happy to set it up if there is some example of doing it. |
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
I am not sure if it requires legal review for that. |
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
So, it seems like we can only run this benchmark on the A100s anyway, so I'm going to disable the A10G configuratoin |
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, see minor inline comments
class Model(BenchmarkModel): | ||
task = NLP.LANGUAGE_MODELING | ||
DEFAULT_EVAL_BSIZE = 1 | ||
DEFAULT_TRAIN_BSIZE = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why the default train batch size is 32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should just delete this, it's meaningless, you can't train 7B without some sort of distribution haha
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
@ezyang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
oh thank god, pr-test is finally passing |
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
@ezyang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Signed-off-by: Edward Z. Yang ezyang@meta.com