Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting datatrove tokenized documents with Nanosets #189

Merged
7 commits merged into from
Jul 31, 2024

Conversation

TJ-Solergibert
Copy link
Contributor

@TJ-Solergibert TJ-Solergibert commented May 31, 2024

In this PR, I update the core mechanism of the Nanosets to read tokens from the byte files using the DatatroveFolderDataset for preprocessing the data with datatrove. To use this PR, we first need to resolve this PR, so we shouldn't run the tests yet (I have updated them, and they all pass in my setup).

In short, DatatroveFolderDataset will be responsible for constructing the dataset object from a folder with the files produced by tokenizing the documents with datatrove's DocumentTokenizer, while the Nanosets will:

  1. Create different data mixtures from multiple datasets,
  2. Ensure that we NEVER exhaust the samples from the DataLoader, and
  3. Ensure that in each epoch, we consume each sample only once.

I had to create a new collator since DatatroveFolderDataset directly returns torch.LongTensor and not np.array.

Users will need to retokenize their data with datatrove. As for the config file, we keep it exactly the same, only renaming dataset_path to dataset_folder to emphasize that the tokenized documents are now in multiple files.

We still need to slightly update the documentation, but first, we need to decide whether we provide a preprocessing script in this repository, add them to the datatrove repo, or directly redirect users to datatrove.

I used a local pipeline similar to the datatrove example to tokenize C4 (examples/tokenize_c4.py) using only JsonlReader and DocumentTokenizer.

@TJ-Solergibert
Copy link
Contributor Author

TJ-Solergibert commented Jun 1, 2024

Ready for a revision! In short:

  • Nanosets now support tokenized documents with DocumentTokenizer from datatrove through DatatroveFolderDataset.
    • Added datatrove[io,processing] dependency to nanosets flavour
    • Refractored tools/preprocess_data.py to tokenize documents w/ datatrove
    • Updated docs/nanoset.md
  • There is 1 slightly change to the config file: dataset_path --> dataset_folder
  • Added Nanosets custom collator as DatatroveFolderDataset is already producing torch.LongTensor
  • Refractored all tests

The intention is to invite the user to check out datatrove to develop their preprocessing workflows. Even so, I have included a very basic script to tokenize HF Datasets from the Hub or jsonl files locally. Would be interesting to have @guipenedo approval, mainly of tools/preprocess_data.py and docs/nanoset.md!

In my setup I am able to run the updated tests, but first it is necessary to merge this PR have an updated release of datatrove (Or temporally install it from source).

Edit: Last commit installs datatrove from source, tests passing in my setup!

Copy link
Member

@xrsrke xrsrke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. LGTM!

@martinjaggi
Copy link

@xrsrke do you think one could merge this? (could be nice to have the larger scale training abilities even without brrr). for us it worked well with this PR

@xrsrke xrsrke closed this pull request by merging all changes into huggingface:main in 6ad5994 Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants