Supporting datatrove tokenized documents with Nanosets #189

TJ-Solergibert · 2024-05-31T01:56:26Z

In this PR, I update the core mechanism of the Nanosets to read tokens from the byte files using the DatatroveFolderDataset for preprocessing the data with datatrove. To use this PR, we first need to resolve this PR, so we shouldn't run the tests yet (I have updated them, and they all pass in my setup).

In short, DatatroveFolderDataset will be responsible for constructing the dataset object from a folder with the files produced by tokenizing the documents with datatrove's DocumentTokenizer, while the Nanosets will:

Create different data mixtures from multiple datasets,
Ensure that we NEVER exhaust the samples from the DataLoader, and
Ensure that in each epoch, we consume each sample only once.

I had to create a new collator since DatatroveFolderDataset directly returns torch.LongTensor and not np.array.

Users will need to retokenize their data with datatrove. As for the config file, we keep it exactly the same, only renaming dataset_path to dataset_folder to emphasize that the tokenized documents are now in multiple files.

We still need to slightly update the documentation, but first, we need to decide whether we provide a preprocessing script in this repository, add them to the datatrove repo, or directly redirect users to datatrove.

I used a local pipeline similar to the datatrove example to tokenize C4 (examples/tokenize_c4.py) using only JsonlReader and DocumentTokenizer.

…ing step in tests

TJ-Solergibert · 2024-06-01T16:07:14Z

Ready for a revision! In short:

Nanosets now support tokenized documents with DocumentTokenizer from datatrove through DatatroveFolderDataset.
- Added datatrove[io,processing] dependency to nanosets flavour
- Refractored tools/preprocess_data.py to tokenize documents w/ datatrove
- Updated docs/nanoset.md
There is 1 slightly change to the config file: dataset_path --> dataset_folder
Added Nanosets custom collator as DatatroveFolderDataset is already producing torch.LongTensor
Refractored all tests

The intention is to invite the user to check out datatrove to develop their preprocessing workflows. Even so, I have included a very basic script to tokenize HF Datasets from the Hub or jsonl files locally. Would be interesting to have @guipenedo approval, mainly of tools/preprocess_data.py and docs/nanoset.md!

In my setup I am able to run the updated tests, but first it is necessary ~~to merge this PR~~ have an updated release of datatrove (Or temporally install it from source).

Edit: Last commit installs datatrove from source, tests passing in my setup!

xrsrke

Thanks for the contribution. LGTM!

martinjaggi · 2024-07-31T12:40:07Z

@xrsrke do you think one could merge this? (could be nice to have the larger scale training abilities even without brrr). for us it worked well with this PR

TJ-Solergibert added 5 commits May 31, 2024 01:46

datatrove is all you need

ad028e6

Refractored preprocessing script to work with datatrove

df1632b

Updated dataset_weights default value

395a4db

Support for jsonl files for preprocess_data.py and updated preprocess…

5f8a52b

…ing step in tests

Updated nanoset docs

9fcd071

Install datatrove from source

d7cfc3f

TJ-Solergibert mentioned this pull request Jul 2, 2024

Supporting datatrove tokenized documents with Nanosets swiss-ai/nanotron#6

Merged

xrsrke approved these changes Jul 15, 2024

View reviewed changes

Little fixes

9992f1c

TJ-Solergibert mentioned this pull request Jul 25, 2024

"datatrove" is missing from the examples folder #175

Closed

xrsrke closed this pull request by merging all changes into huggingface:main in 6ad5994 Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting datatrove tokenized documents with Nanosets #189

Supporting datatrove tokenized documents with Nanosets #189

TJ-Solergibert commented May 31, 2024 •

edited

Loading

TJ-Solergibert commented Jun 1, 2024 •

edited

Loading

xrsrke left a comment

martinjaggi commented Jul 31, 2024

Supporting datatrove tokenized documents with Nanosets #189

Supporting datatrove tokenized documents with Nanosets #189

Conversation

TJ-Solergibert commented May 31, 2024 • edited Loading

TJ-Solergibert commented Jun 1, 2024 • edited Loading

xrsrke left a comment

Choose a reason for hiding this comment

martinjaggi commented Jul 31, 2024

TJ-Solergibert commented May 31, 2024 •

edited

Loading

TJ-Solergibert commented Jun 1, 2024 •

edited

Loading