Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lit-llama benchmarks (logits, autoregressive generation, lora fine tuning) #1730

Closed
wants to merge 9 commits into from

Conversation

ezyang
Copy link
Contributor

@ezyang ezyang commented Jun 12, 2023

Signed-off-by: Edward Z. Yang ezyang@meta.com

…e tuning)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits otherwise thanks! Let's see how long CI will take now lol

def train(self):
logits = self.model(*self.example_inputs)
logits.sum().backward()
# meh this sucks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xd, this might be a good dataset https://huggingface.co/datasets/OpenAssistant/oasst1

Even finetuning on two examples of questions you make up might be not bad as a sanity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this later, I think. Not needed for dynamo benchmarks.

def eval(self):
self.model.eval()
with torch.no_grad():
y = self.model(*self.example_inputs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mind printing the input prompt and the output, will be nice to do vibe checks later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but I don't want to print it here, because then the detokenization would also count as part of the benchmark?

self.model = GenerationWrapper(self.model)
tokenizer = Tokenizer(os.path.join(LIT_LLAMA_PATH, "checkpoints/lit-llama/tokenizer.model"))
# max_new_tokens matches lit-llama/generate.py
self.example_inputs = (tokenizer.encode("The meaning of life is", bos=True, eos=False, device=device), 50)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 50 the max number of tokens to generate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor

@xuzhao9 xuzhao9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. How large is the checkpoint file and is there any rules on the accessing frequency? If we download it too frequently (every CI workflow and every nightly testing workflow), the server might ban our access.

@msaroufim
Copy link
Member

msaroufim commented Jun 12, 2023

@xuzhao9 this will be a common workflow for LLM work (SAM is similar today), it might make sense to cache these files in a github artifact or an S3 bucket if github has data size limits

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
@ezyang
Copy link
Contributor Author

ezyang commented Jun 13, 2023

Do we have any precedent for hosting it in S3? I am happy to set it up if there is some example of doing it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
@xuzhao9
Copy link
Contributor

xuzhao9 commented Jun 13, 2023

Do we have any precedent for hosting it in S3? I am happy to set it up if there is some example of doing it.

I am not sure if it requires legal review for that.
The closest example I found is that Detectron2 models host their checkpoints at https://dl.fbaipublicfiles.com: https://github.com/pytorch/benchmark/blob/main/torchbenchmark/util/framework/detectron2/__init__.py#L12

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
@ezyang
Copy link
Contributor Author

ezyang commented Jun 16, 2023

So, it seems like we can only run this benchmark on the A100s anyway, so I'm going to disable the A10G configuratoin

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Copy link
Contributor

@xuzhao9 xuzhao9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, see minor inline comments

torchbenchmark/models/lit_llama/__init__.py Outdated Show resolved Hide resolved
class Model(BenchmarkModel):
task = NLP.LANGUAGE_MODELING
DEFAULT_EVAL_BSIZE = 1
DEFAULT_TRAIN_BSIZE = 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why the default train batch size is 32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should just delete this, it's meaningless, you can't train 7B without some sort of distribution haha

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
@facebook-github-bot
Copy link
Contributor

@ezyang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor Author

ezyang commented Jul 7, 2023

oh thank god, pr-test is finally passing

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
@facebook-github-bot
Copy link
Contributor

@ezyang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 02ff72b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants