Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with upstream changes #3

Merged
merged 61 commits into from
Mar 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
bfbd032
Added seeds to `evaluator.simple_evaluate` signature (#1412)
Am1n3e Feb 12, 2024
620d6a1
Fix: task weighting by subtask size ; update Pooled Stderr formula sl…
haileyschoelkopf Feb 13, 2024
2d0a646
Refactor utilities into a separate model utils file. (#1429)
baberabb Feb 14, 2024
f3b7917
Update README.md (#1430)
davidbhoffmann Feb 15, 2024
a604f05
improve hf_hub activation (#1438)
michaelfeil Feb 18, 2024
19cbb29
Correct typo in task name (#1443)
larekrow Feb 19, 2024
89deeea
update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zero…
thnkinbtfly Feb 19, 2024
8680e93
Add a new task HaeRae-Bench (#1445)
h-albert-lee Feb 20, 2024
45941c6
Group reqs by context (#1425)
baberabb Feb 20, 2024
5ab295c
Add a new task GPQA (the part without CoT) (#1434)
uanu2002 Feb 20, 2024
c26a6ac
Added KMMLU evaluation method and changed ReadMe (#1447)
h-albert-lee Feb 21, 2024
ba5cdf0
Add TemplateLM boilerplate LM class (#1279)
anjor Feb 22, 2024
00dc996
Log which subtasks were called with which groups (#1456)
haileyschoelkopf Feb 22, 2024
a72babb
PR fixing the issue #1391 (wrong contexts in the mgsm task) (#1440)
leocnj Feb 22, 2024
2683fbb
feat: Add Weights and Biases support (#1339)
ayulockin Feb 22, 2024
75ac1f4
Fixed generation args issue affection OpenAI completion model (#1458)
Am1n3e Feb 22, 2024
8371662
update parsing logic of mgsm following gsm8k (#1462)
thnkinbtfly Feb 23, 2024
eacb74e
Adding documentation for Weights and Biases CLI interface (#1466)
veekaybee Feb 23, 2024
f78e2da
Add environment and transformers version logging in results dump (#1464)
LSinev Feb 24, 2024
d27c0c0
Apply code autoformatting with Ruff to tasks/*.py an *__init__.py (#1…
LSinev Feb 26, 2024
c1145df
setting trust_remote_code (#1467)
veekaybee Feb 26, 2024
7de7b27
add arabic mmlu (#1402)
khalil-Hennara Feb 26, 2024
4c51111
Add Gemma support (Add flag to control BOS token usage) (#1465)
haileyschoelkopf Feb 26, 2024
f6befdb
Revert "setting trust_remote_code (#1467)" (#1474)
haileyschoelkopf Feb 26, 2024
1e6c927
Create a means for caching task registration and request building. Ad…
inf3rnus Feb 26, 2024
96d185f
Cont metrics (#1475)
lintangsutawika Feb 26, 2024
5ccd65d
Refactor `evaluater.evaluate` (#1441)
baberabb Feb 27, 2024
7cd004c
add multilingual mmlu eval (#1484)
jordane95 Feb 27, 2024
a08eb87
update name of val split in truthfulqa multilingual (#1488)
haileyschoelkopf Feb 27, 2024
cc771ec
Fix AttributeError in huggingface.py When 'model_type' is Missing (#1…
richwardle Feb 27, 2024
b177c82
fix duplicated kwargs in some model init (#1495)
lchu-ibm Feb 28, 2024
d272c19
Add multilingual truthfulqa targets (#1499)
jordane95 Mar 1, 2024
284dd80
always include EOS token in stopsequences if possible (#1480)
haileyschoelkopf Mar 1, 2024
27a3da9
Improve data-parallel request partitioning for VLLM (#1477)
haileyschoelkopf Mar 1, 2024
ae79b12
modify `WandbLogger` to accept arbitrary kwargs (#1491)
baberabb Mar 1, 2024
e5e35fc
Vllm update DP+TP (#1508)
baberabb Mar 3, 2024
9516792
Setting trust_remote_code to True for HuggingFace datasets compatibil…
veekaybee Mar 3, 2024
4eba9cf
Cleaning up unused unit tests (#1516)
veekaybee Mar 4, 2024
48476c4
French Bench (#1500)
ManuelFay Mar 4, 2024
4582391
Hotfix: fix TypeError in `--trust_remote_code` (#1517)
haileyschoelkopf Mar 4, 2024
292e581
Fix minor edge cases (#951 #1503) (#1520)
haileyschoelkopf Mar 4, 2024
8a875e9
Openllm benchmark (#1526)
baberabb Mar 5, 2024
01108ac
Add a new task GPQA (the part CoT and generative) (#1482)
uanu2002 Mar 5, 2024
c5acce0
Add EQ-Bench as per #1459 (#1511)
pbevan1 Mar 6, 2024
29b2b01
Add WMDP Multiple-choice (#1534)
justinphan3110 Mar 6, 2024
faee1ad
Adding new task : KorMedMCQA (#1530)
sean0042 Mar 6, 2024
525b8f5
Update docs on LM.loglikelihood_rolling abstract method (#1532)
haileyschoelkopf Mar 6, 2024
0270505
update printed num-fewshot ; prevent fewshots from erroneously being …
haileyschoelkopf Mar 6, 2024
4ee1b38
Cleanup and fixes (Task, Instance, and a little bit of *evaluate) (#1…
LSinev Mar 6, 2024
9e6e240
Update installation commands in openai_completions.py and contributin…
naem1023 Mar 6, 2024
c6f7a54
Merge remote-tracking branch 'upstream/main' into upstream
SpirinEgor Mar 7, 2024
e2cd983
Correct merge of utils.py file
SpirinEgor Mar 7, 2024
07777d1
checks with pre-commit hook
SpirinEgor Mar 7, 2024
38f5b30
Add checks for HF user in CI
SpirinEgor Mar 8, 2024
0b67a76
Fix typing to support python 3.8
SpirinEgor Mar 8, 2024
321d37b
Read secret with HF_TOKEN
SpirinEgor Mar 8, 2024
8051d95
Add compatibility for vLLM's new Logprob object (#1549)
Yard1 Mar 9, 2024
f518228
Fix incorrect `max_gen_toks` generation kwarg default in code2_text. …
cosmo3769 Mar 9, 2024
3bdf25e
Support jinja templating for task descriptions (#1553)
HishamAlyahya Mar 10, 2024
f77c5a9
Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…
SpirinEgor Mar 11, 2024
9ddef0e
Disable few shot in test for arc easy
SpirinEgor Mar 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
10 changes: 10 additions & 0 deletions .github/workflows/new_tasks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,13 +60,23 @@ jobs:
# Install optional git dependencies
# pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi

- name: Check HF User
if: steps.changed-tasks.outputs.tasks_any_modified == 'true' || steps.changed-tasks.outputs.api_any_modified == 'true'
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: huggingface-cli whoami

- name: Test with pytest
# if new tasks are added, run tests on them
if: steps.changed-tasks.outputs.tasks_any_modified == 'true'
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: python -m pytest tests/test_tasks.py -s -vv
# if api is modified, run tests on it
- name: Test more tasks with pytest
env:
API: true
HF_TOKEN: ${{ secrets.HF_TOKEN }}
if: steps.changed-tasks.outputs.api_any_modified == 'true'
run: python -m pytest tests/test_tasks.py -s -vv
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,8 @@ temp
# IPython
profile_default/
ipython_config.py
# don't track (the default location of) the cached requests
lm_eval/caching/.cache
# don't track files created by wandb
wandb
examples/wandb
6 changes: 3 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
exclude: ^tests/testdata/
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.1.0
rev: v4.5.0
hooks:
- id: check-added-large-files
- id: check-ast
Expand All @@ -29,7 +29,7 @@ repos:
args: [--fix=lf]
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.1.8
rev: v0.2.2
hooks:
# Run the linter.
- id: ruff
Expand All @@ -38,7 +38,7 @@ repos:
# Run the formatter.
- id: ruff-format
- repo: https://github.com/codespell-project/codespell
rev: v2.1.0
rev: v2.2.6
hooks:
- id: codespell
exclude: >
Expand Down
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,10 @@ For a full list of supported arguments, check out the [interface](https://github

## Visualizing Results

You can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights & Biases (W&B) and Zeno.

### Zeno

You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.

First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).
Expand Down Expand Up @@ -284,6 +288,41 @@ If you run the eval harness on multiple tasks, the `project_name` will be used a

You can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).

### Weights and Biases

With the [Weights and Biases](https://wandb.ai/site) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.

The integration provide functionalities

- to automatically log the evaluation results,
- log the samples as W&B Tables for easy visualization,
- log the `results.json` file as an artifact for version control,
- log the `<task_name>_eval_samples.json` file if the samples are logged,
- generate a comprehensive report for analysis and visualization with all the important metric,
- log task and cli specific configs,
- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.

First you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.

Authenticate your machine with an your unique W&B token. Visit https://wandb.ai/authorize to get one. Do `wandb login` in your command line terminal.

Run eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.

```bash
lm_eval \
--model hf \
--model_args pretrained=microsoft/phi-2,trust_remote_code=True \
--tasks hellaswag,mmlu_abstract_algebra \
--device cuda:0 \
--batch_size 8 \
--output_path output/phi-2 \
--limit 10 \
--wandb_args project=lm-eval-harness-integration \
--log_samples
```

In the stdout, you will find the link to the W&B run page as well as link to the generated report. You can find an example of this workflow in [examples/visualize-wandb.ipynb](examples/visualize-wandb.ipynb), and an example of how to integrate it beyond the CLI.

## How to Contribute or Learn More?

For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
Expand Down
2 changes: 1 addition & 1 deletion docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ LM Evaluation Harness uses [ruff](https://github.com/astral-sh/ruff) for linting

You can install linters and dev tools via

```pip install lm_eval[dev]```
```pip install lm_eval[dev]``` or ```pip install -e ".[dev]"```

Then, run

Expand Down
7 changes: 2 additions & 5 deletions docs/decontamination.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,14 @@

## Usage

Simply add a "--decontamination_ngrams_path" when running \__main\__.py. The provided directory should contain
The provided directory should contain
the ngram files and info.json produced in "Pile Ngram Generation" further down.

```bash
python -m lm_eval \
--model gpt2 \
--device 0 \
--tasks sciq \
--decontamination_ngrams_path path/containing/training/set/ngrams
--tasks sciq
```

## Background
Expand Down Expand Up @@ -70,5 +69,3 @@ python -m scripts/clean_training_data/compress_and_package \
-output path/to/final/directory \
-procs 8
```

Congratulations, the final directory can now be passed to lm-evaulation-harness with the "--decontamination_ngrams_path" argument.
43 changes: 23 additions & 20 deletions docs/interface.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,49 +10,52 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th

This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:

* `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
- `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs.

* `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
- `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)

* `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.
- `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.

* `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
- `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.

* `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file.
- `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file.

* `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
- `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.

* `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.
- `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.

* `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.
- `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.

* `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
- `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.

* `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
- `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.

* `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
- `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.

* `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
- `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.

* `--decontamination_ngrams_path` : Deprecated, see (this commit)[https://github.com/EleutherAI/lm-evaluation-harness/commit/00209e10f6e27edf5d766145afaf894079b5fe10] or older for a working decontamination-checker tool.
- `--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lm_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.

* `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.
- `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.

* `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.
- `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.

* `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
- `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.

* `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`
- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing ` lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`

* `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.

* `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42.

* `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list (here.)[https://docs.wandb.ai/ref/python/init]. e.g., ```--wandb_args project=test-project,name=test-run```

## External Library Usage

We also support using the library's external API for use within model training loops or other scripts.

`lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`.


`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows:

```python
Expand Down Expand Up @@ -84,14 +87,14 @@ results = lm_eval.simple_evaluate( # call simple_evaluate
)
```


See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L35 for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously.

Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`.

See https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/evaluator.py#L173 for more details.

As a brief example usage of `evaluate()`:

```python
import lm_eval

Expand Down
2 changes: 1 addition & 1 deletion docs/model_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ All three request types take as input `requests` of type `list[Instance]` that h
- It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input.


To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` !
To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! Additionally, check out `lm_eval.api.model.TemplateLM` for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the `lm_eval.models.huggingface.HFLM` class and overriding just the initialization or a couple methods!

**Tip: be careful of indexing in loglikelihood!**

Expand Down
Loading
Loading