Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"datatrove" is missing from the examples folder #175

Closed
RonanKMcGovern opened this issue May 20, 2024 · 5 comments
Closed

"datatrove" is missing from the examples folder #175

RonanKMcGovern opened this issue May 20, 2024 · 5 comments

Comments

@RonanKMcGovern
Copy link

No description provided.

@RonanKMcGovern RonanKMcGovern changed the title "datatrove" is missing from the examples folder (although still mentioned on the REAd "datatrove" is missing from the examples folder May 20, 2024
@martinjaggi
Copy link

yes, this should be clarified better.

for datatrove, you can actually use the new nanosets to load large pretraining datasets, and tokenize using datatrove

@justHungryMan
Copy link

Hi, could you provide an example of using datatrove with nanosets?

@martinjaggi
Copy link

@TJ-Solergibert

@TJ-Solergibert
Copy link
Contributor

TJ-Solergibert commented Jul 25, 2024

Hi @justHungryMan & @RonanKMcGovern! In #189 I changed the supported tokenizing mechanism from the Nanoset tokenizer tool that I developed (And it's in main) to the one using datatrove. I recommend you checking #189 to check the changes, but here you have a little summary:

  • Nanosets now support tokenized documents with DocumentTokenizer from datatrove through DatatroveFolderDataset.
    • Added datatrove[io,processing] dependency to nanosets flavour
    • Refractored tools/preprocess_data.py to tokenize documents w/ datatrove
    • Updated docs/nanoset.md
  • There is 1 slightly change to the config file: dataset_path --> dataset_folder
  • 🚨Last commit installs datatrove from source🚨 (from the project folder run pip install -e '.[nanosets]')

#189 Should get merged soon, but in the meantime you can check swiss-ai/nanotron where we have merged the same PR swiss-ai#6

@martinjaggi
Copy link

It was merged

@xrsrke xrsrke closed this as completed Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants