Skip to content

Latest commit

 

History

History
135 lines (116 loc) · 7.2 KB

nanoset.md

File metadata and controls

135 lines (116 loc) · 7.2 KB

Nanosets

Nanotron incorporates Nanosets, a dataset for processing tokenized documents with datatrove. They allow reading tokens from one or multiple datasets and even specifying the weight of each dataset when building batches.

Install

To use Nanosets, it's necessary to install Nanotron with the nanosets flavor.

pip install nanotron[nanosets]

This will install the following dependencies:

  • datatrove: To preprocess the datasets
  • numba: To compile helper functions in order to speed up the creation of Nanosets
  • transformers: For the tokenizers

Data pre-processing

To use this dataset, first, we need to preprocess the data using datatrove's DocumentTokenizer pipeline. We invite you to take a look at datatrove, since it contains multiple features that allow, for example, filter out documents based on specific rules/criteria, extract text content from raw formats or scheduling the preprocessing in a Slurm cluster. We have also added a simple script capable of tokenizing datasets.

The preprocessing is done using the tools/preprocess_data.py script. The input format can either be a Hugging Face Dataset, a path to a .jsonl or a path to a folder containing multiple .jsonl files. Below we show an example for processing a Hugging Face Dataset from the Hub with the Llama3 tokenizer.

python3 tools/preprocess_data.py \
       --tokenizer-name-or-path meta-llama/Meta-Llama-3-8B \
       --output-folder datasets/emotion \
       --n-tasks 16 \
       hf \
       --dataset dair-ai/emotion \

First with --tokenizer-name-or-path we will specify a tokenizer in the same way as we do when using AutoTokenizers.from_pretrained(...). Then we specify the --output-folder where we will store the tokenized documents and the number of workers with --n-tasks. Finally we will indicate the type of dataset (whether if it's a Hugging Face Dataset ["hf"] or in jsonl ["jsonl"] format) and the dataset that we want to preprocess. Check the different settings with python3 tools/preprocess_data.py --help, python3 tools/preprocess_data.py hf --help & python3 tools/preprocess_data.py jsonl --help.

Every worker will store in --output-folder 3 different kind of files:

  • *.ds Containing the tokenized documents
  • *.ds.index Containing the bounds of each tokenized document
  • *.ds.metadata Containing the number of tokens and tokenizer used

Important

Remember to introduce the type of dataset to process. e.g. python3 tools/preprocess_data.py --tokenizer-name-or-path gpt2 --n-tasks 16 jsonl --dataset raw_datasets/c4-es-json-files

Working with Nanosets

To work with Nanosets, we just need to configure 1 argument:

  1. dataset_folder: This argument specifies the file or files that will compose the Nanoset. There are 3 ways to specify it:
    1. If we specify a single path, we will create a Nanoset from a single dataset file.
    data_stages:
      - name: General purpose training (Single dataset)
        start_training_step: 1
        data:
          dataset:
            dataset_folder: datasets/SlimPajama-6B
          num_loading_workers: 0
          seed: 1234
    1. If we specify a list of paths, we will create a Nanoset from all the dataset files. In every epoch we will consume each and every sample from each dataset randomly.
    data_stages:
      - name: Second purpose training (> 1 dataset)
        start_training_step: 15
        data:
          dataset:
            dataset_folder:
            - datasets/SlimPajama-6B
            - datasets/testing_alpaca_small
          num_loading_workers: 0
          seed: 1234
    1. If we specify a dictionary with paths and weights, we will create a Nanoset from the dataset files where each epoch will have a number of samples from each dataset according to the specified weights.
    data_stages:
      - name: Third purpose training (Blended dataset)
        start_training_step: 25
        data:
          dataset:
            dataset_folder:
              datasets/SlimPajama-6B: 0.8
              datasets/testing_alpaca_small: 0.2
          num_loading_workers: 0
          seed: 1234

Important

Remember to set the tokenizer.tokenizer_name_or_path in the config file to the tokenizer used to preprocess the documents and set the model.model_config.vocab_size accordingly.

Finally, to use the Nanosets, launch the training with run_train.py.

torchrun --nproc-per-node 1 run_train.py --config examples/config_nanoset.yaml

Under the hood

Nanosets are responsible of building samples of sequence length + 1 tokens from the preprocessed dataset files. Despite most of the extracting logic lies in DatatroveFolderDataset, Nanosets will take care of the following:

  1. Creating dataset mixtures from different dataset folder paths
  2. Ensure that in each epoch, we consume each sample only once
  3. Ensure that we never exhaust the DataLoader

Based on the dataset lengths, the dataset weights and the number of samples per epoch (defined as the sum(dataset lengths)), we build the two indexes we need in order to extract samples from the Nanoset (build_nanoset_index_helper):

  • dataset index: Contains the index of the dataset from the list of dataset paths from which to extract the sample, respecting the established dataset weight.
Given:

D = [d0, d1, d2, d3]        # datasets
DL = [8, 2, 5, 5]           # dataset lengths
W = [0.1, 0.5, 0.3, 0.1]    # dataset weights
SPE = 20                    # number of samples per epoch

Then, for example:

dataset_index = [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
  • dataset sample index: Contains the sample index to extract from the dataset index[index] dataset, always < len(dataset).
dataset_index =         [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
dataset_sample_index =  [0, 0, 0, 1, 0, 0, 1, 1, 2, 0, 1, 1, 3, 0, 1, 1, 4, 0, 0, 1]

Then, we shuffle with the same permutation both indexes and concatenate them number of epochs times, which is defined by train split num samples / number of samples per epoch.

Given:

N = 70                      # train split num samples

dataset_index =         [1, 2, 0, 1, 3, 1, 2, 1, 2, 1, 0, 1, 2, 1, 3, 1, 2, 1, 2, 1]
dataset_sample_index =  [0, 0, 0, 1, 0, 0, 1, 1, 2, 0, 1, 1, 3, 0, 1, 1, 4, 0, 0, 1]

Shuffle dataset_index and dataset_sample_index:

dataset_index =         [1, 1, 0, 2, 3, 1, 3, 1, 2, 2, 1, 1, 0, 1, 1, 2, 1, 2, 2, 1]
dataset_sample_index =  [1, 0, 0, 4, 1, 0, 0, 0, 2, 0, 0, 1, 1, 0, 1, 0, 1, 3, 1, 1]

n_concatenations = (70/(20)) + 1 = 4
dataset_index = dataset_index concatenated 4 times
dataset_sample_index = dataset_sample_index concatenated 4 times

dataset_index = dataset_index[: N]
dataset_sample_index = dataset_sample_index[: N]

To query the Nanoset for the k-th sample we do the following:

  • Use the dataset_index to retrieve the corresponding dataset from D and the dataset_sample_index to retrieve the corresponding sample from that dataset.
sample = D[dataset_index[k]][dataset_sample_index[k]]