Sync with upstream changes #3

SpirinEgor · 2024-03-07T16:21:07Z

No description provided.

* Added seeds to `evaluator.simple_evaluate` signature * Added CLI argument * Updated to add arg.

…ightly (EleutherAI#1427) * fix weight_by_size condition * add tests, update stderr formula slightly * apply pre-commit

…shot 0% -> 42%) (EleutherAI#1356) * update bbh, gsm8k, mmlu parsing logic and prompts * remove the formatting prompt (bbh) + minor update (mmlu) * update bbh, gsm8k, mmlu zeroshot, revert fewshots * update bbh, gsm8k, mmlu version, forward changes to gsm8k-cot * remove take_last, update to use docs parameters * add newline * ruff formatting * Update pyproject.toml * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* haerae_reimplementation * edited Readme and add few_shot settings * edited readme * newlines at end of each files * Modifying the README file * applied pre-commit

* add key lookup for same contexts * nit * appease pre-commit * nit * use `expand` (in-place view) rather than `repeat` * try mixed grouping * add docs. * nit * nit * nits * fix tests * Move greedy_tokens calculation out of cache loop * nit * nits * add test * nits * fix name conflict * fix name conflict * chunk tensor * move Collator * nits/docstring * fixup * fixup * group contexts only for decoders * pre-commit * fix `generate_until` test * fix `generate_until` test * Update lm_eval/models/huggingface.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * add docs * nit * add docs * add docs * add 'logits_cache' arg * bugfix --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add new task GPQA_n_shot * add new task GPQA_zeroshot * correct GPQA_zeroshot filename * Add randomly shuffle choices * Correct missing parentheses * delete wrong tasks * Add README * Update lm_eval/tasks/gpqa/zeroshot/_gpqa_zeroshot_yaml * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/n_shot/utils.py * Update lm_eval/tasks/gpqa/README.md * placate linter * linter --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* update kmmlu default formatting * Update _default_kmmlu_yaml * Delete lm_eval/tasks/kmmlu/utils.py * new tasks implemented * add direct tasks * update direct evaluate * update direct eval * add cot sample * update cot * add cot * Update _cot_kmmlu_yaml * add kmmlu90 * Update and rename _cot_kmmlu.yaml to _cot_kmmlu_yaml * Create kmmlu90.yaml * Update _cot_kmmlu_yaml * add direct * Update _cot_kmmlu_yaml * Update and rename kmmlu90.yaml to kmmlu90_cot.yaml * Update kmmlu90_direct.yaml * add kmmlu hard * Update _cot_kmmlu_yaml * Update _cot_kmmlu_yaml * update cot * update cot * erase typo * Update _cot_kmmlu_yaml * update cot * Rename dataset to match k-mmlu-hard * removed kmmlu90 * fixed name 'kmmlu_cot' to 'kmmlu_hard_cot' and revised README * applied pre-commit before pull requests * rename datasets and add notes * Remove DS_Store cache * Update lm_eval/tasks/kmmlu/README.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Change citations and reflect reviews on version * Added kmmlu_hard and fixed other errors * fixing minor errors * remove duplicated * Rename files * try ".index" * minor fix * minor fix again * fix revert. * minor fix. thank for hailey --------- Co-authored-by: GUIJIN SON <spthsrbwls123@yonsei.ac.kr> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

@StellaAthena

* loglikelihood refactor using template lm * linter * fix whitespace in target + prompt for CoT gsm8k (EleutherAI#1275) * Make `parallelize=True` vs. `accelerate launch` distinction clearer in docs (EleutherAI#1261) * Make parallelize=True distinction clearer in documentation. * run linter * Allow parameter edits for registered tasks when listed in a benchmark (EleutherAI#1273) * benchmark yamls allow minor edits of already registered tasks * add documentation * removed print * Fix data-parallel evaluation with quantized models (EleutherAI#1270) * add WIP device_map overrides * update handling outside of accelerate launcher * change .to(device) log to debug level * run linter * Rework documentation for explaining local dataset (EleutherAI#1284) * rewor documentation for explaining local dataset * fix typo * Update new_task_guide.md * Re-add citation It looks like Google Scholar has [already noticed](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C9&authuser=2&q=%22A+framework+for+few-shot+language+model+evaluation%2C+12+2023%22&btnG=) the updated citation block so let's add it back in. * Update CITATION.bib (EleutherAI#1285) Bumping CITATION.bib to match re-adding the citation in readme. cc @StellaAthena * Update nq_open.yaml (EleutherAI#1289) * Update README.md with custom integration doc (EleutherAI#1298) * Update README.md * punctuation --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update nq_open.yaml (EleutherAI#1305) * Update nq_open.yaml change regex * Bump NQ version --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update task_guide.md (EleutherAI#1306) * Update pyproject.toml (EleutherAI#1312) * Fix polemo2_in.yaml config name (EleutherAI#1313) * Update pyproject.toml (EleutherAI#1314) * Fix group register (EleutherAI#1315) * tuple should be considered as well * set option to keep callable as callable * Update task_guide.md (EleutherAI#1316) * Update polemo2_in.yaml (EleutherAI#1318) * don't pass extra kwargs to mamba any more (EleutherAI#1328) * Fix Issue regarding stderr (EleutherAI#1327) * add fix fordeciding if stderr is N/A or not * process N/A * Add `local-completions` support using OpenAI interface (EleutherAI#1277) * Add `local-completions` support using OpenAI interface * Refactor oa_completion * Address tokenizer comments and change request chunks to batch size * Add warning message for tiktoken backend * fix formatting * fix whitespace * Update README.md --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fallback to classname when LM doesnt have config (EleutherAI#1334) * fix a trailing whitespace that breaks a lint job (EleutherAI#1335) * skip "benchmarks" in changed_tasks (EleutherAI#1336) * Update migrated HF dataset paths (EleutherAI#1332) * Update arc_easy.yaml * Update flan_cot.yaml * update HF dataset path * Update freeform.yaml * Update flan_cot.yaml --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Don't use `get_task_dict()` in task registration / initialization (EleutherAI#1331) * don't use get_task_dict() as a helper, it will download the dataset! * pre-commit * Update README.md --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * manage default (greedy) gen_kwargs in vllm (EleutherAI#1341) * manage default (greedy) gen_kwargs in vllm better * mirror HF `do_sample` * just need to set temp=0 for greedy * modified default gen_kwargs to work better with CLI; changed prompt_logprobs=1 (EleutherAI#1345) * update links to task_guide.md (EleutherAI#1348) * `Filter` docs not offset by `doc_id` (EleutherAI#1349) * get `doc` from instance * acceletate bugfix: get ground doc from instance * convert filter to `process_result` * get docs from instances in `FilterEnsemble` * rename * nit * better looping * fix typehint * Add FAQ on `lm_eval.tasks.initialize_tasks()` to README (EleutherAI#1330) * Update README.md * [!Tip] * Refix issue regarding stderr (EleutherAI#1357) * Add causalLM OpenVino models (EleutherAI#1290) * added intel optimum * added intel optimum in readme * modified intel optimum * modified intel optimum * modified intel optimum * modified install optimum * modified path of IR file * added openvino_device * added openvino_device2 * changed optimum-causal to openvino-causal * Update README.md * Update README.md * remove `lm_eval.base` import * update openvino-causal -> openvino ; pass device through super().__init__() * Update README.md * Add optimum to tests dependencies * apply pre-commit * fix so tests pass --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Apply some best practices and guideline recommendations to code (EleutherAI#1363) * raise Exception, not a string Additional info https://peps.python.org/pep-0352/#exception-hierarchy-changes https://docs.python.org/3.8/tutorial/errors.html#raising-exceptions * Apply PEP8 recommendation to prefer isinstance "Object type comparisons should always use isinstance() instead of comparing types directly" https://peps.python.org/pep-0008/ * Remove dangerous default mutable values in arguments https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/dangerous-default-value.html * Format logging messages with fstring (not with format) Additional info https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/logging-format-interpolation.html There are also discussions about the speed of formatting while logging or some unintended code executions pylint-dev/pylint#2395 https://stackoverflow.com/a/54368109 but at least one format (fstring one) will be used throughout the project * Specify utf-8 encoding for `open` explicitly If not specified, it may be supposed differently in different environments, OSes, and Python versions. See https://peps.python.org/pep-0597/ https://docs.python.org/3.11/library/locale.html#locale.getencoding https://docs.python.org/3.10/library/os.html#utf8-mode https://pylint.readthedocs.io/en/stable/user_guide/messages/warning/unspecified-encoding.html Helps also if some code from English language tasks is taken as inspiration for tasks in non-English languages. * Use inline-ignoring comments to pass pre-commit instead of identity process https://flake8.pycqa.org/en/3.0.1/user/ignoring-errors.html#in-line-ignoring-errors https://www.flake8rules.com/rules/F841.html flake8 comments are supported by ruff: https://docs.astral.sh/ruff/linter/#error-suppression * serialize callable functions in config (EleutherAI#1367) * delay filter init; remove `*args` (EleutherAI#1369) * delay filter init; remove `*args` * bugfix * optimize * type hint * Fix unintuitive `--gen_kwargs` behavior (EleutherAI#1329) * don't override do_sample if no value for it is passed * Update gen_kwargs override condition * Update huggingface.py * Update huggingface.py * run linters * silence an erroneous warning * Publish to pypi (EleutherAI#1194) * publish to pypi * lint * Update publish.yml * minor * Make dependencies compatible with PyPI (EleutherAI#1378) * make deps not point to github urls * formatting * try making PyPI only run on tag pushes * Add support for RWKV models with World tokenizer (EleutherAI#1374) * Add support for RWKV models with World tokenizer The RWKV line of model with the World tokenizer, does not allow the padding token to be configured, and has its value preset as 0 This however fails all the "if set" checks, and would cause the tokenizer to crash. A tokenizer class name check was added, in addition to a model type check, as there exists RWKV models which uses the neox tokenizers * Update huggingface.py Genericized so that this supports any RWKVWorld tokenizer, and added a fall-back for if the HF implementation name changes. * Comply with formatting guidelines * fix format --------- Co-authored-by: Stella Biderman <stellabiderman@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * add bypass metric (EleutherAI#1156) * add bypass metric * fixed `bypass` metric. * add task attributes if predict_only * add `predict_only` checks * add docs * added `overide_metric`, `override_config` to `Task` * nits * nit * changed --predict_only to generations; nits * nits * nits * change gen_kwargs warning * add note about `--predict_only` in README.md * added `predict_only` * move table to bottom * nit * change null aggregation to bypass (conflict) * bugfix; default `temp=0.0` * typo * loglikelihood refactor using template lm * lint * code review * neuron optimum * Mention TemplateLM in model_guide.md * Update lm_eval/api/model.py * fix linter * fix format * fix format * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> Co-authored-by: Stella Biderman <stellabiderman@gmail.com> Co-authored-by: Mark Saroufim <marksaroufim@meta.com> Co-authored-by: Hannibal046 <38466901+Hannibal046@users.noreply.github.com> Co-authored-by: Danielle Pintz <38207072+daniellepintz@users.noreply.github.com> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: kwrobel.eth <djstrong@gmail.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com> Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by: thnkinbtfly <70014488+thnkinbtfly@users.noreply.github.com> Co-authored-by: NoushNabi <33136068+NoushNabi@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: LSinev <LSinev@users.noreply.github.com> Co-authored-by: Eugene Cheah <PicoCreator@users.noreply.github.com>

* log group membership * no stray prints * Update evaluator.py

…EleutherAI#1440) * fix the issue EleutherAI#1391, wrong contexts in mgsm tasks * fix yaml issue for having two target_delimiter lines. For COT tasks, keep the one with a space (default) * regenerate all task yaml files - change naming so that file name will match with task name - task|file follows a consistent naming way, mgsm_(mode)_(lang) for three modes, i.e., direct, en_cot, and native_cot * English CoTs should have a space as target_delimiter * Update utils.py * Apply suggestions from code review --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add wandb as extra dependency * wandb metrics logging * refactor * log samples as tables * fix linter * refactor: put in a class * change dir * add panels * log eval as table * improve tables logging * improve reports logging * precommit run * ruff check * handle importing reports api gracefully * ruff * compare results * minor pre-commit fixes * build comparison report * ruff check * log results as artifacts * remove comparison script * update dependency * type annotate and docstring * add example * update readme * fix typo * teardown * handle outside wandb run * gracefully fail reports creation * precommit checks * add report url to summary * use wandb printer for better url stdout * fix ruff * handle N/A and groups * fix eval table * remove unused var * update wandb version req + disable reports stdout * remove reports feature to TODO * add label to multi-choice question data * log model predictions * lints * loglikelihood_rolling * log eval result for groups * log tables by group for better handling * precommit * choices column for multi-choice * graciously fail wandb * remove reports feature * track system metrics + total eval time + stdout --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

…erAI#1458) * Fixed generation args issue affection openai completion model * Fixed hf unit test; removed pop attributes in OpenAi completion. * fix format * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

…#1466) * interface docs * fix link

…utherAI#1464) * Save git_hash to results even if git is not available to call as subprocess * Store more info about environment and transformers version in results to help researchers track inconsistencies * moved added logging to logging_utils * moved get_git_commit_hash to logging_utils.py * moved add_env_info inside evaluator

…eutherAI#1469)

* add arabic mmlu * update the description * add readme file

) * add add_bos_token to HFLM * add BOS token flag to other local model classes --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

This reverts commit c1145df.

EleutherAI#1372) * Create a means for caching task registration and request building. Add the ability to specify an args dict for simple_evaluate(). * Remove extra S in cache path in caching module Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename requests cache args, make model_args polymorphic so that a dict can also be accepted. * Update docs to reflect new caching behavior, add CLI args for requests caching. Create a function for deleting items in the cache. * Update documentation, fix minor bug with arg parsing for requests caching where an undefined variable was used. * Remove line from gitignore, add to cli for caching datasets. * Add hashing suffix to .pickles. Update test script typo. * Favor isinstance() over type() in evaluator.py * Add tests for caching, gets tests working, remove unneeded arg from build_all_requests(). * Update arg description to simple_evaluate. * Update pyproject.toml * Fix typehint * Remove the use of random() for creating default cache pickle hash. * Check that cache dir exists before clearing it in request cache tests. * Fix linting problems. * Fix additional formatting errors. * Remove trailing whitespace. * Add new line to the end of .gitignore. --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* add brier_score * process brier_score * brier score is working for N-sized class * fxied brier score * add TED to BigBench and Brier score to MMLU * format * Update metrics.py * Update task.py * Update generate_until_template_yaml * Delete lm_eval/tasks/bigbench/aux_metric.py * Update generate_until_template_yaml * Update _default_template_yaml * Update _generate_configs.py * Update _generate_configs.py * Update _generate_configs.py * fix (format?) * format? * format, once more --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* change `all_gather` to `gather` * add TaskOutput utility class * Add FilterResults class and refactor task handling. * Rename `key` to `filter_key` for clarity * Add `print_writeout` function in utils.py * Add function to calculate limit size. * Add doc_iterator method to Task class * Refactor `doc_iterator` and cleanup in Task class * remove superfluous bits * change `all_gather` to `gather` * bugfix * bugfix * fix `gather` * Refactor `gather` loop * Refactor aggregate metrics calculation * Refactor and simplify aggregate metrics calculation Removed unused code * Simplify metrics calculation and remove unused code. * simplify the metrics calculation in `utils.py` and `evaluator.py`. * Fix group metric * change evaluate to hf_evaluate * change evaluate to hf_evaluate * add docs * add docs * nits * make isslice keyword only * nit * add todo * nit * nit * nit: swap order samples_metrics tuple * move instance sorting outside loop * nit * nit * Add __repr__ for ConfigurableTask * nit * nit * Revert "nit" This reverts commit dab8d99. * fix some logging * nit * fix `predict_only` bug. thanks to `@LSinev`! * change `print_tasks` to `prepare_print_tasks` * nits * move eval utils * move eval utils * nit * add comment * added tqdm descriptions * Update lm_eval/evaluator_utils.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix mgsm bug * nit * fix `build_all_requests` * pre-commit * add ceil to limit --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

…eutherAI#1489) * model_type attribute error Getting attribute error when using a model without a 'model_type' * fix w/ and w/out the 'model_type' specification * use getattr(), also fix other config.model_type reference * Update huggingface.py --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* make `WandbLogger` init args optional * nit * nit * nit * move import warning to `WandbLogger` * nit * update docs * nit

* use `@ray.remote` with distributed vLLM * update versions * bugfix * unpin vllm * fix pre-commit * added version assertion error * Revert "added version assertion error" This reverts commit 8041e9b. * added version assertion for DP * expand DP note * add warning * nit * pin vllm * fix typos

…ity (EleutherAI#1487) * setting trust_remote_code * dataset list no notebooks * respect trust remote code * Address changes, move cli options and change datasets * fix task for tests * headqa * remove kobest * pin datasets and address comments * clean up space

* add french-bench * rename arc easy * linting * update datasets for no remote code exec * fix string delimiter * add info to readmr * trim trailing whitespace * add detailed groups * add info to readme * remove orangesum title from fbench main * Force PPL tasks to be 0-shot --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Fix padding * Fix elif in model loading * format

* Add new tasks of GPQA * Add README * Remove unused functions * Remove unused functions * Linters * Add flexible match * update * Remove deplicate function * Linter * update * Update lm_eval/filters/extraction.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * register multi_choice_regex * Update * run precommit --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

* Start adding eq-bench * Start adding to yaml and utils * Get metric working * Add README * Handle cases where answer is not parseable * Deal with unparseable answers and add percent_parseable metric * Update README

* init wmdp yaml file * Add WMDP Multiple-choice * fix linter issues * Delete lm_eval/tasks/wmdp/_wmdp.yaml --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

)

…used by cot which hardcodes fewshot prompt (EleutherAI#1502)

…eutherAI#1533) * Remove unused `decontamination_ngrams_path` and all mentions (still no alternative path provided) * Fix improper import of LM and usage of evaluator in one of scripts * update type hints in instance and task api * raising errors in task.py instead of asserts * Fix warnings from ruff * raising errors in __main__.py instead of asserts * raising errors in tasks/__init__.py instead of asserts * raising errors in evaluator.py instead of asserts * evaluator: update type hints and remove unused variables in code * Update lm_eval/__main__.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/__main__.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/api/task.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update lm_eval/evaluator.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * pre-commit induced fixes --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

…g document and, update wandb_args description (EleutherAI#1536) * Update openai completions and docs/CONTRIBUTING.md * Update wandb args description * Update docs/interface.md --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

* Add compatibility for vLLM's new Logprob object * Fix * Update lm_eval/models/vllm_causallms.py * fix format? * trailing whitespace --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

…leutherAI#1551) * update gen_kwargs in code2-text-go.yaml * update gen_kwargs in rest code2-text

* Support jinja templating for "description" * Update task_guide.md * Update lm_eval/api/task.py * fix format? * whitespace errors * fix whitespace * fix bad variable reference --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

…rness into upstream

Am1n3e and others added 30 commits February 12, 2024 15:30

Added seeds to evaluator.simple_evaluate signature (EleutherAI#1412)

bfbd032

* Added seeds to `evaluator.simple_evaluate` signature * Added CLI argument * Updated to add arg.

Fix: task weighting by subtask size ; update Pooled Stderr formula sl…

620d6a1

…ightly (EleutherAI#1427) * fix weight_by_size condition * add tests, update stderr formula slightly * apply pre-commit

Refactor utilities into a separate model utils file. (EleutherAI#1429)

2d0a646

Update README.md (EleutherAI#1430)

f3b7917

improve hf_hub activation (EleutherAI#1438)

a604f05

Correct typo in task name (EleutherAI#1443)

19cbb29

Add a new task HaeRae-Bench (EleutherAI#1445)

8680e93

* haerae_reimplementation * edited Readme and add few_shot settings * edited readme * newlines at end of each files * Modifying the README file * applied pre-commit

Log which subtasks were called with which groups (EleutherAI#1456)

00dc996

* log group membership * no stray prints * Update evaluator.py

update parsing logic of mgsm following gsm8k (EleutherAI#1462)

8371662

Adding documentation for Weights and Biases CLI interface (EleutherAI…

eacb74e

…#1466) * interface docs * fix link

Apply code autoformatting with Ruff to tasks/*.py an *__init__.py (El…

d27c0c0

…eutherAI#1469)

setting trust_remote_code (EleutherAI#1467)

c1145df

add arabic mmlu (EleutherAI#1402)

7de7b27

* add arabic mmlu * update the description * add readme file

Add Gemma support (Add flag to control BOS token usage) (EleutherAI#1465

4c51111

) * add add_bos_token to HFLM * add BOS token flag to other local model classes --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai>

Revert "setting trust_remote_code (EleutherAI#1467)" (EleutherAI#1474)

f6befdb

This reverts commit c1145df.

add multilingual mmlu eval (EleutherAI#1484)

7cd004c

update name of val split in truthfulqa multilingual (EleutherAI#1488)

a08eb87

baberabb and others added 27 commits March 1, 2024 11:09

modify WandbLogger to accept arbitrary kwargs (EleutherAI#1491)

ae79b12

* make `WandbLogger` init args optional * nit * nit * nit * move import warning to `WandbLogger` * nit * update docs * nit

Cleaning up unused unit tests (EleutherAI#1516)

4eba9cf

Hotfix: fix TypeError in --trust_remote_code (EleutherAI#1517)

4582391

Fix minor edge cases (EleutherAI#951 EleutherAI#1503) (EleutherAI#1520)

292e581

* Fix padding * Fix elif in model loading * format

Openllm benchmark (EleutherAI#1526)

8a875e9

Add EQ-Bench as per EleutherAI#1459 (EleutherAI#1511)

c5acce0

* Start adding eq-bench * Start adding to yaml and utils * Get metric working * Add README * Handle cases where answer is not parseable * Deal with unparseable answers and add percent_parseable metric * Update README

Add WMDP Multiple-choice (EleutherAI#1534)

29b2b01

* init wmdp yaml file * Add WMDP Multiple-choice * fix linter issues * Delete lm_eval/tasks/wmdp/_wmdp.yaml --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

Adding new task : KorMedMCQA (EleutherAI#1530)

faee1ad

Update docs on LM.loglikelihood_rolling abstract method (EleutherAI#1532

525b8f5

)

update printed num-fewshot ; prevent fewshots from erroneously being …

0270505

…used by cot which hardcodes fewshot prompt (EleutherAI#1502)

Merge remote-tracking branch 'upstream/main' into upstream

c6f7a54

Correct merge of utils.py file

e2cd983

checks with pre-commit hook

07777d1

Add checks for HF user in CI

38f5b30

Fix typing to support python 3.8

0b67a76

Read secret with HF_TOKEN

321d37b

Fix incorrect max_gen_toks generation kwarg default in code2_text. (E…

f518228

…leutherAI#1551) * update gen_kwargs in code2-text-go.yaml * update gen_kwargs in rest code2-text

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

f77c5a9

…rness into upstream

Disable few shot in test for arc easy

9ddef0e

Mogreine approved these changes Mar 12, 2024

View reviewed changes

SpirinEgor merged commit 88ceee1 into main Mar 12, 2024
5 of 6 checks passed

SpirinEgor deleted the upstream branch March 12, 2024 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with upstream changes #3

Sync with upstream changes #3

SpirinEgor commented Mar 7, 2024

Sync with upstream changes #3

Sync with upstream changes #3

Conversation

SpirinEgor commented Mar 7, 2024