-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting datatrove tokenized documents with Nanosets #189
Conversation
Ready for a revision! In short:
The intention is to invite the user to check out In my setup I am able to run the updated tests, but first it is necessary Edit: Last commit installs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. LGTM!
@xrsrke do you think one could merge this? (could be nice to have the larger scale training abilities even without brrr). for us it worked well with this PR |
6ad5994
In this PR, I update the core mechanism of the
Nanosets
to read tokens from the byte files using theDatatroveFolderDataset
for preprocessing the data withdatatrove
. To use this PR, we first need to resolve this PR, so we shouldn't run the tests yet (I have updated them, and they all pass in my setup).In short,
DatatroveFolderDataset
will be responsible for constructing the dataset object from a folder with the files produced by tokenizing the documents withdatatrove
'sDocumentTokenizer
, while theNanosets
will:DataLoader
, andI had to create a new collator since
DatatroveFolderDataset
directly returnstorch.LongTensor
and notnp.array
.Users will need to retokenize their data with
datatrove
. As for the config file, we keep it exactly the same, only renamingdataset_path
todataset_folder
to emphasize that the tokenized documents are now in multiple files.We still need to slightly update the documentation, but first, we need to decide whether we provide a preprocessing script in this repository, add them to the
datatrove
repo, or directly redirect users todatatrove
.I used a local pipeline similar to the
datatrove
example to tokenize C4 (examples/tokenize_c4.py
) using onlyJsonlReader
andDocumentTokenizer
.