23 Oct 16:02

NathanHB

a11a1b2

v0.6.0 Latest

Latest

What's New

Lighteval becomes massively multilingual!

We now have extensive coverage in many languages, as well as new templates to manage multilinguality more easily.

Add 3 NLI tasks supporting 26 unique languages. #329 by @hynky1999
- xnli
- xnli2.0
- indic_xnli
- cmnli + ocnli
- rcb
Add 3 COPA tasks supporting about 20 unique languages. #330 by @hynky1999
- xcopa
- indic-copa
- parus
Add Hellaswag tasks supporting about 36 unique languages. #332 by @hynky1999
- mlmm_hellaswag
- hellaswag_{tha/tur}
Add RC tasks supporting about 130 unique languages/scripts. #333 by @hynky1999
- xquad
- thaiqa
- sber_squad
- arcd
- kenswquad
- chinese_squad
- cmrc2018
- indicqa
- fquad_v2
- tydiqa
- beleble
Add GK tasks supporting about 35 unique languages/scripts. #338 by @hynky1999
- meta_mmlu
- mlmm_mmlu
- rummlu
- mmlu_ara_mcf
- tur_leaderboard_mmlu
- cmmlu
- mmlu
- ceval
- mlmm_arc_challenge
- alghafa_arc_easy
- community_arc
- community_truthfulqa
- exams
- m3exams
- thai_exams
- xcsqa
- alghafa_piqa
- mera_openbookqa
- alghafa_openbookqa
- alghafa_sciqa
- mathlogic_qa
- agieval
- mera_worldtree
Misc Tasks #339 by @hynky1999
- openai_mmlu_tasks
- turkish_mmlu_tasks
- lumi arc
- hindi/swahili/arabic (from alghafa) arc
- cmath
- mgsm
- xcodah
- xstory
- xwinograd + tr winograd
- mlqa
- mkqa
- mintaka
- mlqa_tasks
- french triviaqa
- chegeka
- acva
- french_boolq
- hindi_boolq
Serbian LLM Benchmark Task by @DeanChugall in #340
iroko bench by @hynky1999 in #357

Other Tasks

MixEval Task by @NathanHB in #337

Features

Now Evaluate OpenAI models by @NathanHB in #359
New Doc and README by @NathanHB in #327
Refacto LLM as A Judge by @NathanHB in #337
Selecting tasks using their superset by @hynky1999 in #308
Nicer output on task search failure by @hynky1999 in #357
Adds tasks templating by @hynky1999 in #335
Support for multilingual generative metrics by @hynky1999 in #293
Class implementations of faithfulness and extractiveness metrics by @chuandudx in #323
Translation literals by @hynky1999 in #356

Bug Fixes

Math normalization: do not crash on invalid format by @guipenedo in #331
Skipping push to hub test by @clefourrier in #334
Fix Metrics import path in community task template file. by @chuandudx in #309
Allow kwargs for BERTScore compute function and remove unused var by @chuandudx in #311
Fixes sampling for vllm when num_samples==1 by @edbeeching in #343
Fix the dataset loading for custom tasks by @clefourrier in #364
Fix: missing property tag in inference endpoints by @clefourrier in #368
Fix Tokenization + misc fixes by @hynky1999 in #354
Fix BLEURT evaluation errors by @chuandudx in #316
Adds Baseline workflow + fixes by @hynky1999 in #363

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@hynky1999
- Support for multilingual generative metrics (#293)
- Adds tasks templating (#335)
- Multilingual NLI Tasks (#329)
- Multilingual COPA tasks (#330)
- Multilingual Hellaswag tasks (#332)
- Multilingual Reading Comprehension tasks (#333)
- Multilingual General Knowledge tasks (#338)
- Selecting tasks using their superset (#308)
- Fix Tokenization + misc fixes (#354)
- Misc-multilingual tasks (#339)
- add iroko bench + nicer output on task search failure (#357)
- Translation literals (#356)
- selected tasks for multilingual evaluation (#371)
- Adds Baseline workflow + fixes (#363)
@DeanChugall
- Serbian LLM Benchmark Task (#340)
@NathanHB
- readme rewrite (#327)
- refacto judge and add mixeval (#337)
- bump lighteval versoin (#328)
- fix (#347)
- Nathan llm judge quickfix (#348)
- Nathan llm judge quickfix (#350)
- adds openai models (#359)

New Contributors

@chuandudx made their first contribution in #323
@edbeeching made their first contribution in #343
@DeanChugall made their first contribution in #340
@Stopwolf made their first contribution in #225
@martinscooper made their first contribution in #366

Full Changelog: v0.5.0...v0.6.0

Contributors

guipenedo, DeanChugall, and 7 other contributors

Assets 2

24 Sep 13:38

NathanHB

v0.5.0

05dfa28

v0.5.0

What's new

Features

Tokenization-wise encoding by @hynky1999 in #287
Task config by @hynky1999 in #289

Bug fixes

Fixes bug: You can't create a model without either a list of model_args or a model_config_path when model_config_path was submited by @NathanHB in #298
skip tests if secrets not provided by @hynky1999 in #304
[FIX] vllm backend by @NathanHB in #317

Contributors

NathanHB and hynky1999

Assets 2

05 Sep 13:28

NathanHB

v0.4.0

ad6444e

v0.4.0

What's new

Features

Adds vlmm as backend for insane speed up by @NathanHB in #274
Add llm_as_judge in metrics (using both OpenAI or Transformers) by @NathanHB in #146
Abale to use config files for models by @clefourrier in #131
List available tasks in the cli lighteval tasks --list by @DimbyTa in #142
Use torch compile for speed up by @clefourrier in #248
Add maj@k metric by @clefourrier in #158
Adds a dummy/random model for baseline init by @guipenedo in #220
lighteval is now a cli tool: lighteval --args by @NathanHB in #152
We can now log info from the metrics (for example input and response from llm_as_judge) by @NathanHB in #157
Configurable task versioning by @PhilipMay in #181
Programmatic interface by @clefourrier in #269
Probability Metric + New Normalization by @hynky1999 in #276
Add widgets to the README by @clefourrier in #145

New tasks

Add Ger-RAG-evaltasks. by @PhilipMay in #149
adding aimo custom eval by @NathanHB in #154

Fixes

Bump nltlk to 3.9.1 to fix security issue by @NathanHB in #137
Fix max_length type when being passed in model args by @csarron in #138
Fix nanotron models input size bug by @clefourrier in #156
Fix MATH normalization by @lewtun in #162
fix Prompt function names by @clefourrier in #168
Fix prompt format german rag community task by @jphme in #171
add 'cite as' section in readme by @NathanHB in #178
Fix broken link to extended tasks in README by @alexrs in #182
Mention HF_TOKEN in readme by @Wauplin in #194
Download BERT scorer lazily by @sadra-barikbin in #190
Updated tgi_model and added parameters for endpoint_model by @shaltielshmid in #208
fix llm as judge warnings by @NathanHB in #173
ADD GPT-4 as Judge by @philschmid in #206
Fix a few typos and do a tiny refactor by @sadra-barikbin in #187
Avoid truncating the outputs based on string lengths by @anton-l in #201
Now only uses functions for prompt definition by @clefourrier in #213
Data split depending on eval params by @clefourrier in #169
should fix most inference endpoints issues of version config by @clefourrier in #226
Fix _init_max_length in base_model.py by @gucci-j in #185
Make evaluator invariant of input request type order by @sadra-barikbin in #215
Fixing issues with multichoice_continuations_start_space - was not parsed properly by @clefourrier in #232
Fix IFEval metric by @lewtun in #259
change priority when choosing model dtype by @NathanHB in #263
Add grammar option to generation by @sadra-barikbin in #242
make info loggers dataclass, so that their properties have expected lifetime by @hynky1999 in #280
Remove expensive prediction run during test collection by @hynky1999 in #279
Example Configs and Docs by @RohitMidha23 in #255
Refactoring the few shot management by @clefourrier in #272
Standalone nanotron config by @hynky1999 in #285
Logging Revamp by @hynky1999 in #284
bump nltk version by @NathanHB in #290

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@NathanHB
- commit (#137)
- Add llm as judge in metrics (#146)
- Nathan add logging to metrics (#157)
- add 'cite as' section in readme (#178)
- Fix citation section in readme (#180)
- adding aimo custom eval (#154)
- fix llm as judge warnings (#173)
- launch lighteval using lighteval --args (#152)
- adds llm as judge using transformers (#223)
- Fix missing json file (#264)
- change priority when choosing model dtype (#263)
- fix the location of tasks list in the readme (#267)
- updates ifeval repo (#268)
- fix nanotron (#283)
- add vlmm backend (#274)
- bump nltk version (#290)
@clefourrier
- Add config files for models (#131)
- Add fun widgets to the README (#145)
- Fix nanotron models input size bug (#156)
- no function we actually use should be named prompt_fn (#168)
- Add maj@k metric (#158)
- Homogeneize logging system (#150)
- Use only dataclasses for task init (#212)
- Now only uses functions for prompt definition (#213)
- Data split depending on eval params (#169)
- should fix most inference endpoints issues of version config (#226)
- Add metrics as functions (#214)
- Quantization related issues (#224)
- Update issue templates (#235)
- remove latex writer since we don't use it (#231)
- Removes default bert scorer init (#234)
- fix (#233)
- udpated piqa (#222)
- uses torch compile if provided (#248)
- Fix inference endpoint config (#244)
- Expose samples via the CLI (#228)
- Fixing issues with multichoice_continuations_start_space - was not parsed properly (#232)
- Programmatic interface + cleaner management of requests (#269)
- Small file reorg (only renames/moves) (#271)
- Refactoring the few shot management (#272)
@PhilipMay
- Add Ger-RAG-evaltasks. (#149)
- Add version config option. (#181)
@shaltielshmid
- Added Namespace parameter for InferenceEndpoints, added option for passing model config directly (#147)
- Updated tgi_model and added parameters for endpoint_model (#208)
@hynky1999
- make info loggers dataclass, so that their properties have expected lifetime (#280)
- Remove expensive prediction run during test collection (#279)
- Probability Metric + New Normalization (#276)
- Standalone nanotron config (#285)
- Logging Revamp (#284)

Contributors

PhilipMay, alexrs, and 15 other contributors

Assets 2

29 Mar 16:42

NathanHB

v0.3.0

e6fdcc7

v0.3.0

Release Note

This introduced the new extended tasks feature, documentation and many other patches for improved stability.
New tasks are also introduced:

Big Bench Hard: https://huggingface.co/papers/2210.09261
AGIEval: https://huggingface.co/papers/2304.06364
TinyBench:
MT Bench: https://huggingface.co/papers/2306.05685
AlGhafa Benchmarking Suite: https://aclanthology.org/2023.arabicnlp-1.21/

MT-Bench marks the introduction of multi-turn prompting as well as llm-as-a-judge metric.

New tasks

Add BBH by @clefourrier in #7, @bilgehanertan in #126
Add AGIEval by @clefourrier in #121
Adding TinyBench by @clefourrier in #104
Adding support for Arabic benchmarks : AlGhafa benchmarking suite by @alielfilali01 in #95
Add mt-bench by @NathanHB in #75

Features

Extended Tasks ! by @clefourrier in #101, @lewtun in #108, @NathanHB in #122, #123
Added support for launching inference endpoint with different model dtypes by @shaltielshmid in #124

Documentation

Adding LICENSE by @clefourrier in #86, @NathanHB in #89
Make it clearer in the README that the leaderboard uses the harness by @clefourrier in #94

Small patches

Update huggingface-hub for compatibility with datasets 2.18 by @clefourrier in #84
Tidy up dependency groups by @lewtun in #81
bump git python by @NathanHB in #90
Sets a max length for the MATH task by @clefourrier in #83
Fix parallel data processing bug by @clefourrier in #92
Change the eos condition for GSM8K by @clefourrier in #85
Fixing rolling loglikelihood management by @clefourrier in #78
Fixes input length management for generative evals by @clefourrier in #103
Reorder addition of instruction in chat template by @clefourrier in #111
Ensure chat models terminate generation with EOS token by @lewtun in #115
Fix push details to hub by @NathanHB in #98
Small fixes to InferenceEndpointModel by @shaltielshmid in #112
Fix import typo autogptq by @clefourrier in #116
Fixed the loglikelihood method in inference endpoints models by @clefourrier in #119
Fix TextGenerationResponse import from hfh by @Wauplin in #129
Do not use deprecated list_files_info by @Wauplin in #133
Update test workflow name to 'Tests' by @Wauplin in #134

New Contributors

@shaltielshmid made their first contribution in #112
@bilgehanertan made their first contribution in #126
@Wauplin made their first contribution in #129

Full Changelog: v0.2.0...v0.3.0

Contributors

Wauplin, shaltielshmid, and 5 other contributors

Assets 2

01 Mar 14:31

NathanHB

v0.2.0

ab05db9

v0.2.0

Release Note

This release focuses on customization and personalisation: it's now possible to define custom metrics, not just custom tasks, see the README for the full mechanism.
Also includes small fixes to improve stability and new tasks. We made the choice to split community tasks from the main library source to better manage maintenance.

Better community task handling

New mechanism for evaluation contributions by @clefourrier in #47
Adding the custom metrics system by @clefourrier in #65

New tasks

Add GPQA by @clefourrier in #42
Adding support for Arabic benchmarks : AceGPT benchmarking suite by @alielfilali01 in #44
IFEval by @clefourrier in #48

Features

Add an automatic system to compute average for tasks with subtasks by @clefourrier in #41

small patches

Typos #27, #28, #30, #29, #34,
Better README #26, #37, #55,
Patch fix to match with config update/simplification in nanotron by @thomwolf in #35
bump transformers to 4.38 by @NathanHB in #46
Small fix to be able to use extensions of nanotron configs by @thomwolf in #58
Remove the eos token override in the Default Config Task by @clefourrier in #54
Update leaderboard task set by @lewtun in #60
Remove the eos token override in the Default Config Task by @clefourrier in #54
Fixes wikitext prompts + some patches on tg models by @clefourrier in #64
Fix unset generation size by @clefourrier in #76
Update ruff by @clefourrier in #71
Relax sentencepiece version by @lewtun in #74
Better chat template system by @clefourrier in #38

✨ Community Contributions

@ledrui made their first contribution in #26
@alielfilali01 made their first contribution in #44
@lewtun made their first contribution in #55

Full Changelog: v0.1.1...v0.2.0

Contributors

ledrui, thomwolf, and 4 other contributors

Assets 2

09 Feb 11:29

thomwolf

v0.1.1

adf1031

v0.1.1

Small patch for PyPi release

Include tasks_table.jsonl in package

Assets 2

08 Feb 10:27

NathanHB

v0.1.0

468d144

v0.1.0

Init

LightEval 🌤️

A lightweight LLM evaluation

Context

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.

We're releasing it with the community in the spirit of building in the open.

Note that it is still very much early so don't expect 100% stability ^^'
In case of problems or question, feel free to open an issue!

Full Changelog: https://github.com/huggingface/lighteval/commits/v0.1

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's New

Lighteval becomes massively multilingual!

Other Tasks

Features

Bug Fixes

Significant community contributions

New Contributors

Contributors

What's new

Features

Bug fixes

Contributors

What's new

Features

New tasks

Fixes

Significant community contributions

Contributors

Release Note

New tasks

Features

Documentation

Small patches

New Contributors

Contributors

Release Note

Better community task handling

New tasks

Features

small patches

✨ Community Contributions

Contributors

Init

LightEval 🌤️

Context

Releases: huggingface/lighteval

v0.6.0

What's New

Lighteval becomes massively multilingual!

Other Tasks

Features

Bug Fixes

Significant community contributions

New Contributors

Contributors

v0.5.0

What's new

Features

Bug fixes

Contributors

v0.4.0

What's new

Features

New tasks

Fixes

Significant community contributions

Contributors

v0.3.0

Release Note

New tasks

Features

Documentation

Small patches

New Contributors

Contributors

v0.2.0

Release Note

Better community task handling

New tasks

Features

small patches

✨ Community Contributions

Contributors

v0.1.1

v0.1.0

Init

LightEval 🌤️

Context