training_data:
- name: "<corpus_name>=<split>:<weight>"
source_prefix_text: "Beginning of source." # concept added at the beginning of source
source_suffix_text: "End of source." # concept added at the end of source
target_prefix_text: "Beginning of target." # concept added at the beginning of target (supervised data only)
target_suffix_text: "End of target." # concept added at the end of target (supervised data only)
- name: "<corpus2_name>=<split>:<weight2>"
data_loading_config:
max_tokens: 7168 # Exclusive with batch_size
batch_size: none # Exclusive with max_tokens
len_to_wrap_long_seq: 128 # Sequences longer than this will be wrapped.
packing: true # if True, documents in the batch will be packed.
The batch content can be defined in several ways:
max_tokens
/len_to_wrap_long_seq
approximatebatch_size
.batch_size
xlen_to_wrap_long_seq
approximatemax_tokens
.
Note that len_to_wrap_long_seq
has to be smaller than the model's max_seq_len
defined in the architecture (e.g. two_tower_diffusion_lcm_1_6B
`).
To filter out long samples without wrapping, you can add filters
to each dataset config to filter based on the length of the document's list of sentences (text_sentences
):
- name: "<corpus_name>=<split>:<weight>"
source_prefix_text: "Beginning of source."
filters: 'pa.compute.less(pa.compute.list_value_length(pa.dataset.field("text_sentences")), 128)'
checkpoint_every_n_steps: 2_000 # QED
keep_last_n_checkpoints: 2 # delete all but last N non-consolidated checkpoints
save_model_every_n_steps: 10_000 # consolidate model every N steps (valid if using FSDP)
preserve_consolidated_models: True # preserve the consolidated checkpoints