Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to look into the processed data? #266

Open
shizhediao opened this issue Aug 16, 2024 · 3 comments
Open

How to look into the processed data? #266

shizhediao opened this issue Aug 16, 2024 · 3 comments

Comments

@shizhediao
Copy link
Contributor

shizhediao commented Aug 16, 2024

Hi,

After running tokenize_from_hf_to_s3.py, I would like to inspect the resulting data. But I find that the current data is in a binary file (.ds). is there a way to allow me to look into the data?

Thanks!

@RicardoDominguez
Copy link

The following works for me

import numpy as np

from datatrove.pipeline.tokens.merger import load_doc_ends, get_data_reader

def read_tokenized_data(data_file):
    with open(f"{data_file}.index", 'rb') as f:
        doc_ends = load_doc_ends(f)

    reader = get_data_reader(open(data_file, 'rb'), doc_ends, nb_bytes=2)
    decode = lambda x: np.frombuffer(x, dtype=np.uint16).astype(int)
    return map(decode, reader)


from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

data_file = 'test/000_test.ds'
for i, input_ids in enumerate(read_tokenized_data(data_file)):
    if i == 5:
        break
    print(len(input_ids))
    print(tokenizer.decode(input_ids))
    print('\n-------------------\n')

@RicardoDominguez
Copy link

RicardoDominguez commented Aug 29, 2024

Alternatively, you could use DatatroveFileDataset from here.

from datatrove.utils.dataset import DatatroveFileDataset

path = 'test/test_tokenized_00000_00000_shuffled.ds'

dataset = DatatroveFileDataset(
    file_path=path,
    seq_len=2048,
    token_size=2,
)

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

for batch in dataset:
    input_ids = batch['input_ids'].numpy()
    print(tokenizer.decode(input_ids))
    break

@shizhediao
Copy link
Contributor Author

shizhediao commented Aug 29, 2024

Thank you so much! I will have a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants