Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenized dataset? #10

Open
joelburget opened this issue Sep 15, 2024 · 1 comment
Open

Tokenized dataset? #10

joelburget opened this issue Sep 15, 2024 · 1 comment

Comments

@joelburget
Copy link

I was wondering if it'd be possible to upload the tokenized dataset. I tried following the instructions under the Pretraining header but had trouble installing Megablocks due to a CUDA version mismatch. Anyway, I think it would be very helpful to upload the tokenized dataset to Huggingface to save others the work.

@Muennighoff
Copy link
Collaborator

Agree that this would be great; @soldni what do you think? Here are all the s3 paths of the tokenized ds, can we easily upload them to HF?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants