Releases: huggingface/lighteval
v0.6.0
What's New
Lighteval becomes massively multilingual!
We now have extensive coverage in many languages, as well as new templates to manage multilinguality more easily.
-
Add 3 NLI tasks supporting 26 unique languages. #329 by @hynky1999
-
Add 3 COPA tasks supporting about 20 unique languages. #330 by @hynky1999
-
Add Hellaswag tasks supporting about 36 unique languages. #332 by @hynky1999
- mlmm_hellaswag
- hellaswag_{tha/tur}
-
Add RC tasks supporting about 130 unique languages/scripts. #333 by @hynky1999
-
Add GK tasks supporting about 35 unique languages/scripts. #338 by @hynky1999
- meta_mmlu
- mlmm_mmlu
- rummlu
- mmlu_ara_mcf
- tur_leaderboard_mmlu
- cmmlu
- mmlu
- ceval
- mlmm_arc_challenge
- alghafa_arc_easy
- community_arc
- community_truthfulqa
- exams
- m3exams
- thai_exams
- xcsqa
- alghafa_piqa
- mera_openbookqa
- alghafa_openbookqa
- alghafa_sciqa
- mathlogic_qa
- agieval
- mera_worldtree
-
Misc Tasks #339 by @hynky1999
- openai_mmlu_tasks
- turkish_mmlu_tasks
- lumi arc
- hindi/swahili/arabic (from alghafa) arc
- cmath
- mgsm
- xcodah
- xstory
- xwinograd + tr winograd
- mlqa
- mkqa
- mintaka
- mlqa_tasks
- french triviaqa
- chegeka
- acva
- french_boolq
- hindi_boolq
-
Serbian LLM Benchmark Task by @DeanChugall in #340
-
iroko bench by @hynky1999 in #357
Other Tasks
Features
- Now Evaluate OpenAI models by @NathanHB in #359
- New Doc and README by @NathanHB in #327
- Refacto LLM as A Judge by @NathanHB in #337
- Selecting tasks using their superset by @hynky1999 in #308
- Nicer output on task search failure by @hynky1999 in #357
- Adds tasks templating by @hynky1999 in #335
- Support for multilingual generative metrics by @hynky1999 in #293
- Class implementations of faithfulness and extractiveness metrics by @chuandudx in #323
- Translation literals by @hynky1999 in #356
Bug Fixes
- Math normalization: do not crash on invalid format by @guipenedo in #331
- Skipping push to hub test by @clefourrier in #334
- Fix Metrics import path in community task template file. by @chuandudx in #309
- Allow kwargs for BERTScore compute function and remove unused var by @chuandudx in #311
- Fixes sampling for vllm when num_samples==1 by @edbeeching in #343
- Fix the dataset loading for custom tasks by @clefourrier in #364
- Fix: missing property tag in inference endpoints by @clefourrier in #368
- Fix Tokenization + misc fixes by @hynky1999 in #354
- Fix BLEURT evaluation errors by @chuandudx in #316
- Adds Baseline workflow + fixes by @hynky1999 in #363
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @hynky1999
- Support for multilingual generative metrics (#293)
- Adds tasks templating (#335)
- Multilingual NLI Tasks (#329)
- Multilingual COPA tasks (#330)
- Multilingual Hellaswag tasks (#332)
- Multilingual Reading Comprehension tasks (#333)
- Multilingual General Knowledge tasks (#338)
- Selecting tasks using their superset (#308)
- Fix Tokenization + misc fixes (#354)
- Misc-multilingual tasks (#339)
- add iroko bench + nicer output on task search failure (#357)
- Translation literals (#356)
- selected tasks for multilingual evaluation (#371)
- Adds Baseline workflow + fixes (#363)
- @DeanChugall
- Serbian LLM Benchmark Task (#340)
- @NathanHB
New Contributors
- @chuandudx made their first contribution in #323
- @edbeeching made their first contribution in #343
- @DeanChugall made their first contribution in #340
- @Stopwolf made their first contribution in #225
- @martinscooper made their first contribution in #366
Full Changelog: v0.5.0...v0.6.0
v0.5.0
What's new
Features
- Tokenization-wise encoding by @hynky1999 in #287
- Task config by @hynky1999 in #289
Bug fixes
v0.4.0
What's new
Features
- Adds vlmm as backend for insane speed up by @NathanHB in #274
- Add llm_as_judge in metrics (using both OpenAI or Transformers) by @NathanHB in #146
- Abale to use config files for models by @clefourrier in #131
- List available tasks in the cli
lighteval tasks --list
by @DimbyTa in #142 - Use torch compile for speed up by @clefourrier in #248
- Add maj@k metric by @clefourrier in #158
- Adds a dummy/random model for baseline init by @guipenedo in #220
- lighteval is now a cli tool:
lighteval --args
by @NathanHB in #152 - We can now log info from the metrics (for example input and response from llm_as_judge) by @NathanHB in #157
- Configurable task versioning by @PhilipMay in #181
- Programmatic interface by @clefourrier in #269
- Probability Metric + New Normalization by @hynky1999 in #276
- Add widgets to the README by @clefourrier in #145
New tasks
- Add
Ger-RAG-eval
tasks. by @PhilipMay in #149 - adding
aimo
custom eval by @NathanHB in #154
Fixes
- Bump nltlk to 3.9.1 to fix security issue by @NathanHB in #137
- Fix max_length type when being passed in model args by @csarron in #138
- Fix nanotron models input size bug by @clefourrier in #156
- Fix MATH normalization by @lewtun in #162
- fix Prompt function names by @clefourrier in #168
- Fix prompt format german rag community task by @jphme in #171
- add 'cite as' section in readme by @NathanHB in #178
- Fix broken link to extended tasks in README by @alexrs in #182
- Mention HF_TOKEN in readme by @Wauplin in #194
- Download BERT scorer lazily by @sadra-barikbin in #190
- Updated tgi_model and added parameters for endpoint_model by @shaltielshmid in #208
- fix llm as judge warnings by @NathanHB in #173
- ADD GPT-4 as Judge by @philschmid in #206
- Fix a few typos and do a tiny refactor by @sadra-barikbin in #187
- Avoid truncating the outputs based on string lengths by @anton-l in #201
- Now only uses functions for prompt definition by @clefourrier in #213
- Data split depending on eval params by @clefourrier in #169
- should fix most inference endpoints issues of version config by @clefourrier in #226
- Fix _init_max_length in base_model.py by @gucci-j in #185
- Make evaluator invariant of input request type order by @sadra-barikbin in #215
- Fixing issues with multichoice_continuations_start_space - was not parsed properly by @clefourrier in #232
- Fix IFEval metric by @lewtun in #259
- change priority when choosing model dtype by @NathanHB in #263
- Add grammar option to generation by @sadra-barikbin in #242
- make info loggers dataclass, so that their properties have expected lifetime by @hynky1999 in #280
- Remove expensive prediction run during test collection by @hynky1999 in #279
- Example Configs and Docs by @RohitMidha23 in #255
- Refactoring the few shot management by @clefourrier in #272
- Standalone nanotron config by @hynky1999 in #285
- Logging Revamp by @hynky1999 in #284
- bump nltk version by @NathanHB in #290
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @NathanHB
- commit (#137)
- Add llm as judge in metrics (#146)
- Nathan add logging to metrics (#157)
- add 'cite as' section in readme (#178)
- Fix citation section in readme (#180)
- adding aimo custom eval (#154)
- fix llm as judge warnings (#173)
- launch lighteval using
lighteval --args
(#152) - adds llm as judge using transformers (#223)
- Fix missing json file (#264)
- change priority when choosing model dtype (#263)
- fix the location of tasks list in the readme (#267)
- updates ifeval repo (#268)
- fix nanotron (#283)
- add vlmm backend (#274)
- bump nltk version (#290)
- @clefourrier
- Add config files for models (#131)
- Add fun widgets to the README (#145)
- Fix nanotron models input size bug (#156)
- no function we actually use should be named prompt_fn (#168)
- Add maj@k metric (#158)
- Homogeneize logging system (#150)
- Use only dataclasses for task init (#212)
- Now only uses functions for prompt definition (#213)
- Data split depending on eval params (#169)
- should fix most inference endpoints issues of version config (#226)
- Add metrics as functions (#214)
- Quantization related issues (#224)
- Update issue templates (#235)
- remove latex writer since we don't use it (#231)
- Removes default bert scorer init (#234)
- fix (#233)
- udpated piqa (#222)
- uses torch compile if provided (#248)
- Fix inference endpoint config (#244)
- Expose samples via the CLI (#228)
- Fixing issues with multichoice_continuations_start_space - was not parsed properly (#232)
- Programmatic interface + cleaner management of requests (#269)
- Small file reorg (only renames/moves) (#271)
- Refactoring the few shot management (#272)
- @PhilipMay
- @shaltielshmid
- @hynky1999
v0.3.0
Release Note
This introduced the new extended tasks feature, documentation and many other patches for improved stability.
New tasks are also introduced:
- Big Bench Hard: https://huggingface.co/papers/2210.09261
- AGIEval: https://huggingface.co/papers/2304.06364
- TinyBench:
- MT Bench: https://huggingface.co/papers/2306.05685
- AlGhafa Benchmarking Suite: https://aclanthology.org/2023.arabicnlp-1.21/
MT-Bench marks the introduction of multi-turn prompting as well as llm-as-a-judge metric.
New tasks
- Add BBH by @clefourrier in #7, @bilgehanertan in #126
- Add AGIEval by @clefourrier in #121
- Adding TinyBench by @clefourrier in #104
- Adding support for Arabic benchmarks : AlGhafa benchmarking suite by @alielfilali01 in #95
- Add mt-bench by @NathanHB in #75
Features
- Extended Tasks ! by @clefourrier in #101, @lewtun in #108, @NathanHB in #122, #123
- Added support for launching inference endpoint with different model dtypes by @shaltielshmid in #124
Documentation
- Adding LICENSE by @clefourrier in #86, @NathanHB in #89
- Make it clearer in the README that the leaderboard uses the harness by @clefourrier in #94
Small patches
- Update huggingface-hub for compatibility with datasets 2.18 by @clefourrier in #84
- Tidy up dependency groups by @lewtun in #81
- bump git python by @NathanHB in #90
- Sets a max length for the MATH task by @clefourrier in #83
- Fix parallel data processing bug by @clefourrier in #92
- Change the eos condition for GSM8K by @clefourrier in #85
- Fixing rolling loglikelihood management by @clefourrier in #78
- Fixes input length management for generative evals by @clefourrier in #103
- Reorder addition of instruction in chat template by @clefourrier in #111
- Ensure chat models terminate generation with EOS token by @lewtun in #115
- Fix push details to hub by @NathanHB in #98
- Small fixes to InferenceEndpointModel by @shaltielshmid in #112
- Fix import typo autogptq by @clefourrier in #116
- Fixed the loglikelihood method in inference endpoints models by @clefourrier in #119
- Fix TextGenerationResponse import from hfh by @Wauplin in #129
- Do not use deprecated list_files_info by @Wauplin in #133
- Update test workflow name to 'Tests' by @Wauplin in #134
New Contributors
- @shaltielshmid made their first contribution in #112
- @bilgehanertan made their first contribution in #126
- @Wauplin made their first contribution in #129
Full Changelog: v0.2.0...v0.3.0
v0.2.0
Release Note
This release focuses on customization and personalisation: it's now possible to define custom metrics, not just custom tasks, see the README for the full mechanism.
Also includes small fixes to improve stability and new tasks. We made the choice to split community tasks from the main library source to better manage maintenance.
Better community task handling
- New mechanism for evaluation contributions by @clefourrier in #47
- Adding the custom metrics system by @clefourrier in #65
New tasks
- Add GPQA by @clefourrier in #42
- Adding support for Arabic benchmarks : AceGPT benchmarking suite by @alielfilali01 in #44
- IFEval by @clefourrier in #48
Features
- Add an automatic system to compute average for tasks with subtasks by @clefourrier in #41
small patches
- Typos #27, #28, #30, #29, #34,
- Better README #26, #37, #55,
- Patch fix to match with config update/simplification in nanotron by @thomwolf in #35
- bump transformers to 4.38 by @NathanHB in #46
- Small fix to be able to use extensions of nanotron configs by @thomwolf in #58
- Remove the eos token override in the Default Config Task by @clefourrier in #54
- Update leaderboard task set by @lewtun in #60
- Remove the eos token override in the Default Config Task by @clefourrier in #54
- Fixes wikitext prompts + some patches on tg models by @clefourrier in #64
- Fix unset generation size by @clefourrier in #76
- Update ruff by @clefourrier in #71
- Relax sentencepiece version by @lewtun in #74
- Better chat template system by @clefourrier in #38
✨ Community Contributions
- @ledrui made their first contribution in #26
- @alielfilali01 made their first contribution in #44
- @lewtun made their first contribution in #55
Full Changelog: v0.1.1...v0.2.0
v0.1.1
v0.1.0
Init
LightEval 🌤️
A lightweight LLM evaluation
Context
LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
We're releasing it with the community in the spirit of building in the open.
Note that it is still very much early so don't expect 100% stability ^^'
In case of problems or question, feel free to open an issue!
Full Changelog: https://github.com/huggingface/lighteval/commits/v0.1