Main ingredients of training recipes

Training and validation data

training_data:
  - name: "<corpus_name>=<split>:<weight>"
    source_prefix_text: "Beginning of source."  # concept added at the beginning of source
    source_suffix_text: "End of source."        # concept added at the end of source
    target_prefix_text: "Beginning of target."  # concept added at the beginning of target (supervised data only)
    target_suffix_text: "End of target."        # concept added at the end of target (supervised data only)

  - name: "<corpus2_name>=<split>:<weight2>"

Data loading config

data_loading_config:
  max_tokens: 7168      # Exclusive with batch_size
  batch_size: none        # Exclusive with max_tokens
  len_to_wrap_long_seq: 128 # Sequences longer than this will be wrapped.
  packing: true     # if True, documents in the batch will be packed.

The batch content can be defined in several ways:

max_tokens / len_to_wrap_long_seq approximate batch_size.
batch_size x len_to_wrap_long_seq approximate max_tokens.

Note that len_to_wrap_long_seq has to be smaller than the model's max_seq_len defined in the architecture (e.g. two_tower_diffusion_lcm_1_6B`).

To filter out long samples without wrapping, you can add filters to each dataset config to filter based on the length of the document's list of sentences (text_sentences):

 - name: "<corpus_name>=<split>:<weight>"
   source_prefix_text: "Beginning of source."
   filters: 'pa.compute.less(pa.compute.list_value_length(pa.dataset.field("text_sentences")), 128)'

Checkpointing config

checkpoint_every_n_steps: 2_000     # QED
keep_last_n_checkpoints: 2          # delete all but last N non-consolidated checkpoints
save_model_every_n_steps: 10_000    # consolidate model every N steps (valid if using FSDP)
preserve_consolidated_models: True  # preserve the consolidated checkpoints

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Main ingredients of training recipes

Training and validation data

Data loading config

Checkpointing config

Files

README.md

Latest commit

History

README.md

File metadata and controls

Main ingredients of training recipes

Training and validation data

Data loading config

Checkpointing config