Skip to content

Commit

Permalink
commit
Browse files Browse the repository at this point in the history
  • Loading branch information
NathanHB committed Sep 18, 2024
1 parent 3aba2a1 commit af1ad13
Show file tree
Hide file tree
Showing 3 changed files with 133 additions and 74 deletions.
2 changes: 2 additions & 0 deletions docs/source/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ appropriate extras group.
| tensorboardX | To upload your results to tensorboard |
| vllm | To use vllm as backend for inference |
| s3 | To upload results to s3 |


## Hugging Face login

If you want to push your results to the Hugging Face Hub or evaluate your own
Expand Down
142 changes: 69 additions & 73 deletions docs/source/metric_list.md
Original file line number Diff line number Diff line change
@@ -1,78 +1,74 @@
# Metrics

- MetricCategory.TARGET_PERPLEXITY
- acc_golds_likelihood
- target_perplexity
## Metrics for multiple choice tasks
These metrics use log-likelihood of the different possible targets.
- `loglikelihood_acc` (Harness): Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_single_token`)
- `loglikelihood_acc_norm` (Harness): Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_norm_single_token`)
- `loglikelihood_acc_norm_nospace` (Harness): Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct, with the first space ignored
- `loglikelihood_f1` (Harness): Corpus level F1 score of the multichoice selection - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_f1_single_token`)
- `mcc` (Harness): Matthew's correlation coefficient (a measure of agreement between statistical distributions),
- `recall_at_1` (Harness): Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_1_single_token`)
- `recall_at_2` (Harness): Fraction of instances where the choice with the 2nd best logprob or better was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_2_single_token`)
- `mrr` (Harness): Mean reciprocal rank, a measure of the quality of a ranking of choices ordered by correctness/relevance - also exists in a faster version for tasks where the possible choices include only one token (`mrr_single_token`)
- `target_perplexity` (Harness): Perplexity of the different choices available.
- `acc_golds_likelihood`: (Harness): A bit different, it actually checks if the average logprob of a single target is above or below 0.5
- `multi_f1_numeric`: Loglikelihood F1 score for multiple gold targets

- MetricCategory.MULTICHOICE_ONE_TOKEN
- loglikelihood_acc_norm_single_token
- loglikelihood_acc_single_token
- loglikelihood_f1_single_token
- mcc_single_token
- mrr_single_token
- multi_f1_numeric
- recall_at_1_single_token
- recall_at_2_single_token
All these metrics also exist in a "single token" version (`loglikelihood_acc_single_token`, `loglikelihood_acc_norm_single_token`, `loglikelihood_f1_single_token`, `mcc_single_token`, `recall@2_single_token` and `mrr_single_token`). When the multichoice option compares only one token (ex: "A" vs "B" vs "C" vs "D", or "yes" vs "no"), using these metrics in the single token version will divide the time spent by the number of choices. Single token evals also include:
- `multi_f1_numeric` (Harness, for CB): computes the f1 score of all possible choices and averages it.

- MetricCategory.IGNORED
- prediction_perplexity
## Metrics for perplexity and language modeling
These metrics use log-likelihood of prompt.
- `word_perplexity` (Harness): Perplexity (log probability of the input) weighted by the number of words of the sequence.
- `byte_perplexity` (Harness): Perplexity (log probability of the input) weighted by the number of bytes of the sequence.
- `bits_per_byte` (HELM): Average number of bits per byte according to model probabilities.
- `log_prob` (HELM): Predicted output's average log probability (input's log prob for language modeling).

- MetricCategory.PERPLEXITY
- bits_per_byte
- byte_perplexity
- word_perplexity

- MetricCategory.GENERATIVE
- bert_score
- bleu
- bleu_1
- bleu_4
- bleurt
- chrf
- copyright
- drop
- exact_match
- extractiveness
- f1_score_quasi
- f1_score
- f1_score_macro
- f1_score_micro
- faithfulness
- perfect_exact_match
- prefix_exact_match
- prefix_quasi_exact_match
- quasi_exact_match
- quasi_exact_match_math
- quasi_exact_match_triviaqa
- quasi_exact_match_gsm8k
- rouge_t5
- rouge1
- rouge2
- rougeL
- rougeLsum
- ter

- MetricCategory.GENERATIVE_SAMPLING
- maj_at_4_math
- maj_at_5
- maj_at_8
- maj_at_8_gsm8k

- MetricCategory.LLM_AS_JUDGE_MULTI_TURN
- llm_judge_multi_turn_gpt3p5
- llm_judge_multi_turn_llama_3_405b

- MetricCategory.LLM_AS_JUDGE
- llm_judge_gpt3p5
- llm_judge_llama_3_405b

- MetricCategory.MULTICHOICE
- loglikelihood_acc
- loglikelihood_acc_norm
- loglikelihood_acc_norm_nospace
- loglikelihood_f1
- mcc
- mrr
- recall_at_1
- recall_at_2
- truthfulqa_mc_metrics
## Metrics for generative tasks
These metrics need the model to generate an output. They are therefore slower.
- Base:
- `perfect_exact_match` (Harness): Fraction of instances where the prediction matches the gold exactly.
- `exact_match` (HELM): Fraction of instances where the prediction matches the gold with the exception of the border whitespaces (= after a `strip` has been applied to both).
- `quasi_exact_match` (HELM): Fraction of instances where the normalized prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...). Other variations exist, with other normalizers, such as `quasi_exact_match_triviaqa`, which only normalizes the predictions after applying a strip to all sentences.
- `prefix_exact_match` (HELM): Fraction of instances where the beginning of the prediction matches the gold at the exception of the border whitespaces (= after a `strip` has been applied to both).
- `prefix_quasi_exact_match` (HELM): Fraction of instances where the normalized beginning of the prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...)
- `exact_match_indicator`: Exact match with some preceding context (before an indicator) removed
- `f1_score_quasi` (HELM): Average F1 score in terms of word overlap between the model output and gold, with both being normalized first
- `f1_score`: Average F1 score in terms of word overlap between the model output and gold without normalisation
- `f1_score_macro`: Corpus level macro F1 score
- `f1_score_macro`: Corpus level micro F1 score
- `maj_at_5` and `maj_at_8`: Model majority vote. Takes n (5 or 8) generations from the model and assumes the most frequent is the actual prediction.
- Summarization:
- `rouge` (Harness): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/)
- `rouge1` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 1-gram overlap.
- `rouge2` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 2-gram overlap.
- `rougeL` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on longest common subsequence overlap.
- `rougeLsum` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on longest common subsequence overlap.
- `rouge_t5` (BigBench): Corpus level ROUGE score for all available ROUGE metrics
- `faithfulness` (HELM): Faithfulness scores based on the SummaC method of [Laban et al. (2022)](https://aclanthology.org/2022.tacl-1.10/).
- `extractiveness` (HELM): Reports, based on [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/)
- `summarization_coverage`: Extent to which the model-generated summaries are extractive fragments from the source document,
- `summarization_density`: Extent to which the model-generated summaries are extractive summaries based on the source document,
- `summarization_compression`: Extent to which the model-generated summaries are compressed relative to the source document.
- `bert_score` (HELM): Reports the average BERTScore precision, recall, and f1 score [(Zhang et al., 2020)](https://openreview.net/pdf?id=SkeHuCVFDr) between model generation and gold summary.
- Translation
- `bleu`: Corpus level BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) - uses the sacrebleu implementation.
- `bleu_1` (HELM): Average sample BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 1-gram overlap - uses the nltk implementation.
- `bleu_4` (HELM): Average sample BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 4-gram overlap - uses the nltk implementation.
- `chrf` (Harness): Character n-gram matches f-score.
- `ter` (Harness): Translation edit/error rate.
- Copyright
- `copyright` (HELM): Reports:
- `longest_common_prefix_length`: average length of longest common prefix between model generation and reference,
- `edit_distance`: average Levenshtein edit distance between model generation and reference,
- `edit_similarity`: average Levenshtein edit similarity (normalized by length of longer sequence) between model generation and reference.
- Math:
- `quasi_exact_match_math` (HELM): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for math, where latex symbols, units, etc are removed)
- `maj_at_4_math` (Lighteval): Majority choice evaluation, using the math normalisation for the predictions and gold
- `quasi_exact_match_gsm8k` (Harness): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed)
- `maj_at_8_gsm8k` (Lighteval): Majority choice evaluation, using the gsm8k normalisation for the predictions and gold
- LLM-as-Judge:
- `llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API
- `llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API
- `llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API. It is used for multiturn tasks like mt-bench.
- `llm_judge_multi_turn_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API. It is used for multiturn tasks like mt-bench.
63 changes: 62 additions & 1 deletion docs/source/quicktour.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,67 @@ accelerate launch --multi_gpu --num_processes=8 -m \
Here, `--override_batch_size` defines the batch size per device, so the effective
batch size will be `override_batch_size * num_gpus`.
### Model Arguments
The `--model_args` argument takes a string representing a list of model
argument. The arguments allowed vary depending on the backend you use (vllm or
accelerate).
#### Accelerate
- **pretrained** (str):
HuggingFace Hub model ID name or the path to a pre-trained
model to load. This is effectively the `pretrained_model_name_or_path`
argument of `from_pretrained` in the HuggingFace `transformers` API.
- **tokenizer** (Optional[str]): HuggingFace Hub tokenizer ID that will be
used for tokenization.
- **multichoice_continuations_start_space** (Optional[bool]): Whether to add a
space at the start of each continuation in multichoice generation.
For example, context: "What is the capital of France?" and choices: "Paris", "London".
Will be tokenized as: "What is the capital of France? Paris" and "What is the capital of France? London".
True adds a space, False strips a space, None does nothing
- **subfolder** (Optional[str]): The subfolder within the model repository.
- **revision** (str): The revision of the model.
- **max_gen_toks** (Optional[int]): The maximum number of tokens to generate.
- **max_length** (Optional[int]): The maximum length of the generated output.
- **add_special_tokens** (bool, optional, defaults to True): Whether to add special tokens to the input sequences.
If `None`, the default value will be set to `True` for seq2seq models (e.g. T5) and
`False` for causal models.
- **model_parallel** (bool, optional, defaults to False):
True/False: force to use or not the `accelerate` library to load a large
model across multiple devices.
Default: None which corresponds to comparing the number of processes with
the number of GPUs. If it's smaller => model-parallelism, else not.
- **dtype** (Union[str, torch.dtype], optional, defaults to None):):
Converts the model weights to `dtype`, if specified. Strings get
converted to `torch.dtype` objects (e.g. `float16` -> `torch.float16`).
Use `dtype="auto"` to derive the type from the model's weights.
- **device** (Union[int, str]): device to use for model training.
- **quantization_config** (Optional[BitsAndBytesConfig]): quantization
configuration for the model, manually provided to load a normally floating point
model at a quantized precision. Needed for 4-bit and 8-bit precision.
- **trust_remote_code** (bool): Whether to trust remote code during model
loading.
#### VLLM
- **pretrained** (str): HuggingFace Hub model ID name or the path to a pre-trained model to load.
- **gpu_memory_utilisation** (float): The fraction of GPU memory to use.
- **batch_size** (int): The batch size for model training.
- **revision** (str): The revision of the model.
- **dtype** (str, None): The data type to use for the model.
- **tensor_parallel_size** (int): The number of tensor parallel units to use.
- **data_parallel_size** (int): The number of data parallel units to use.
- **max_model_length** (int): The maximum length of the model.
- **swap_space** (int): The CPU swap space size (GiB) per GPU.
- **seed** (int): The seed to use for the model.
- **trust_remote_code** (bool): Whether to trust remote code during model loading.
- **use_chat_template** (bool): Whether to use the chat template or not.
- **add_special_tokens** (bool): Whether to add special tokens to the input sequences.
- **multichoice_continuations_start_space** (bool): Whether to add a space at the start of each continuation in multichoice generation.
- **subfolder** (Optional[str]): The subfolder within the model repository.
#### Pipeline parallelism
To evaluate a model using pipeline parallelism on 2 or more GPUs, run:
Expand Down Expand Up @@ -96,6 +157,6 @@ Nanotron models cannot be evaluated without torchrun.
```
The `nproc-per-node` argument should match the data, tensor and pipeline
parallelism confidured in the `lighteval_config_override_template.yaml` file.
parallelism confidured in the `lighteval_config_template.yaml` file.
That is: `nproc-per-node = data_parallelism * tensor_parallelism *
pipeline_parallelism`.

0 comments on commit af1ad13

Please sign in to comment.