What's new

Features

Adds vlmm as backend for insane speed up by @NathanHB in #274
Add llm_as_judge in metrics (using both OpenAI or Transformers) by @NathanHB in #146
Abale to use config files for models by @clefourrier in #131
List available tasks in the cli lighteval tasks --list by @DimbyTa in #142
Use torch compile for speed up by @clefourrier in #248
Add maj@k metric by @clefourrier in #158
Adds a dummy/random model for baseline init by @guipenedo in #220
lighteval is now a cli tool: lighteval --args by @NathanHB in #152
We can now log info from the metrics (for example input and response from llm_as_judge) by @NathanHB in #157
Configurable task versioning by @PhilipMay in #181
Programmatic interface by @clefourrier in #269
Probability Metric + New Normalization by @hynky1999 in #276
Add widgets to the README by @clefourrier in #145

New tasks

Add Ger-RAG-evaltasks. by @PhilipMay in #149
adding aimo custom eval by @NathanHB in #154

Fixes

Bump nltlk to 3.9.1 to fix security issue by @NathanHB in #137
Fix max_length type when being passed in model args by @csarron in #138
Fix nanotron models input size bug by @clefourrier in #156
Fix MATH normalization by @lewtun in #162
fix Prompt function names by @clefourrier in #168
Fix prompt format german rag community task by @jphme in #171
add 'cite as' section in readme by @NathanHB in #178
Fix broken link to extended tasks in README by @alexrs in #182
Mention HF_TOKEN in readme by @Wauplin in #194
Download BERT scorer lazily by @sadra-barikbin in #190
Updated tgi_model and added parameters for endpoint_model by @shaltielshmid in #208
fix llm as judge warnings by @NathanHB in #173
ADD GPT-4 as Judge by @philschmid in #206
Fix a few typos and do a tiny refactor by @sadra-barikbin in #187
Avoid truncating the outputs based on string lengths by @anton-l in #201
Now only uses functions for prompt definition by @clefourrier in #213
Data split depending on eval params by @clefourrier in #169
should fix most inference endpoints issues of version config by @clefourrier in #226
Fix _init_max_length in base_model.py by @gucci-j in #185
Make evaluator invariant of input request type order by @sadra-barikbin in #215
Fixing issues with multichoice_continuations_start_space - was not parsed properly by @clefourrier in #232
Fix IFEval metric by @lewtun in #259
change priority when choosing model dtype by @NathanHB in #263
Add grammar option to generation by @sadra-barikbin in #242
make info loggers dataclass, so that their properties have expected lifetime by @hynky1999 in #280
Remove expensive prediction run during test collection by @hynky1999 in #279
Example Configs and Docs by @RohitMidha23 in #255
Refactoring the few shot management by @clefourrier in #272
Standalone nanotron config by @hynky1999 in #285
Logging Revamp by @hynky1999 in #284
bump nltk version by @NathanHB in #290

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@NathanHB
- commit (#137)
- Add llm as judge in metrics (#146)
- Nathan add logging to metrics (#157)
- add 'cite as' section in readme (#178)
- Fix citation section in readme (#180)
- adding aimo custom eval (#154)
- fix llm as judge warnings (#173)
- launch lighteval using lighteval --args (#152)
- adds llm as judge using transformers (#223)
- Fix missing json file (#264)
- change priority when choosing model dtype (#263)
- fix the location of tasks list in the readme (#267)
- updates ifeval repo (#268)
- fix nanotron (#283)
- add vlmm backend (#274)
- bump nltk version (#290)
@clefourrier
- Add config files for models (#131)
- Add fun widgets to the README (#145)
- Fix nanotron models input size bug (#156)
- no function we actually use should be named prompt_fn (#168)
- Add maj@k metric (#158)
- Homogeneize logging system (#150)
- Use only dataclasses for task init (#212)
- Now only uses functions for prompt definition (#213)
- Data split depending on eval params (#169)
- should fix most inference endpoints issues of version config (#226)
- Add metrics as functions (#214)
- Quantization related issues (#224)
- Update issue templates (#235)
- remove latex writer since we don't use it (#231)
- Removes default bert scorer init (#234)
- fix (#233)
- udpated piqa (#222)
- uses torch compile if provided (#248)
- Fix inference endpoint config (#244)
- Expose samples via the CLI (#228)
- Fixing issues with multichoice_continuations_start_space - was not parsed properly (#232)
- Programmatic interface + cleaner management of requests (#269)
- Small file reorg (only renames/moves) (#271)
- Refactoring the few shot management (#272)
@PhilipMay
- Add Ger-RAG-evaltasks. (#149)
- Add version config option. (#181)
@shaltielshmid
- Added Namespace parameter for InferenceEndpoints, added option for passing model config directly (#147)
- Updated tgi_model and added parameters for endpoint_model (#208)
@hynky1999
- make info loggers dataclass, so that their properties have expected lifetime (#280)
- Remove expensive prediction run during test collection (#279)
- Probability Metric + New Normalization (#276)
- Standalone nanotron config (#285)
- Logging Revamp (#284)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.0

What's new

Features

New tasks

Fixes

Significant community contributions

Contributors