Skip to content

Latest commit

 

History

History
48 lines (36 loc) · 1.96 KB

README.md

File metadata and controls

48 lines (36 loc) · 1.96 KB

Main ingredients of training recipes

Training and validation data

training_data:
  - name: "<corpus_name>=<split>:<weight>"
    source_prefix_text: "Beginning of source."  # concept added at the beginning of source
    source_suffix_text: "End of source."        # concept added at the end of source
    target_prefix_text: "Beginning of target."  # concept added at the beginning of target (supervised data only)
    target_suffix_text: "End of target."        # concept added at the end of target (supervised data only)

  - name: "<corpus2_name>=<split>:<weight2>"

Data loading config

data_loading_config:
  max_tokens: 7168      # Exclusive with batch_size
  batch_size: none        # Exclusive with max_tokens
  len_to_wrap_long_seq: 128 # Sequences longer than this will be wrapped.
  packing: true     # if True, documents in the batch will be packed.

The batch content can be defined in several ways:

  • max_tokens / len_to_wrap_long_seq approximate batch_size.
  • batch_size x len_to_wrap_long_seq approximate max_tokens.

Note that len_to_wrap_long_seq has to be smaller than the model's max_seq_len defined in the architecture (e.g. two_tower_diffusion_lcm_1_6B`).

To filter out long samples without wrapping, you can add filters to each dataset config to filter based on the length of the document's list of sentences (text_sentences):

 - name: "<corpus_name>=<split>:<weight>"
   source_prefix_text: "Beginning of source."
   filters: 'pa.compute.less(pa.compute.list_value_length(pa.dataset.field("text_sentences")), 128)'

Checkpointing config

checkpoint_every_n_steps: 2_000     # QED
keep_last_n_checkpoints: 2          # delete all but last N non-consolidated checkpoints
save_model_every_n_steps: 10_000    # consolidate model every N steps (valid if using FSDP)
preserve_consolidated_models: True  # preserve the consolidated checkpoints