v0.4.0
What's new
Features
- Adds vlmm as backend for insane speed up by @NathanHB in #274
- Add llm_as_judge in metrics (using both OpenAI or Transformers) by @NathanHB in #146
- Abale to use config files for models by @clefourrier in #131
- List available tasks in the cli
lighteval tasks --list
by @DimbyTa in #142 - Use torch compile for speed up by @clefourrier in #248
- Add maj@k metric by @clefourrier in #158
- Adds a dummy/random model for baseline init by @guipenedo in #220
- lighteval is now a cli tool:
lighteval --args
by @NathanHB in #152 - We can now log info from the metrics (for example input and response from llm_as_judge) by @NathanHB in #157
- Configurable task versioning by @PhilipMay in #181
- Programmatic interface by @clefourrier in #269
- Probability Metric + New Normalization by @hynky1999 in #276
- Add widgets to the README by @clefourrier in #145
New tasks
- Add
Ger-RAG-eval
tasks. by @PhilipMay in #149 - adding
aimo
custom eval by @NathanHB in #154
Fixes
- Bump nltlk to 3.9.1 to fix security issue by @NathanHB in #137
- Fix max_length type when being passed in model args by @csarron in #138
- Fix nanotron models input size bug by @clefourrier in #156
- Fix MATH normalization by @lewtun in #162
- fix Prompt function names by @clefourrier in #168
- Fix prompt format german rag community task by @jphme in #171
- add 'cite as' section in readme by @NathanHB in #178
- Fix broken link to extended tasks in README by @alexrs in #182
- Mention HF_TOKEN in readme by @Wauplin in #194
- Download BERT scorer lazily by @sadra-barikbin in #190
- Updated tgi_model and added parameters for endpoint_model by @shaltielshmid in #208
- fix llm as judge warnings by @NathanHB in #173
- ADD GPT-4 as Judge by @philschmid in #206
- Fix a few typos and do a tiny refactor by @sadra-barikbin in #187
- Avoid truncating the outputs based on string lengths by @anton-l in #201
- Now only uses functions for prompt definition by @clefourrier in #213
- Data split depending on eval params by @clefourrier in #169
- should fix most inference endpoints issues of version config by @clefourrier in #226
- Fix _init_max_length in base_model.py by @gucci-j in #185
- Make evaluator invariant of input request type order by @sadra-barikbin in #215
- Fixing issues with multichoice_continuations_start_space - was not parsed properly by @clefourrier in #232
- Fix IFEval metric by @lewtun in #259
- change priority when choosing model dtype by @NathanHB in #263
- Add grammar option to generation by @sadra-barikbin in #242
- make info loggers dataclass, so that their properties have expected lifetime by @hynky1999 in #280
- Remove expensive prediction run during test collection by @hynky1999 in #279
- Example Configs and Docs by @RohitMidha23 in #255
- Refactoring the few shot management by @clefourrier in #272
- Standalone nanotron config by @hynky1999 in #285
- Logging Revamp by @hynky1999 in #284
- bump nltk version by @NathanHB in #290
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @NathanHB
- commit (#137)
- Add llm as judge in metrics (#146)
- Nathan add logging to metrics (#157)
- add 'cite as' section in readme (#178)
- Fix citation section in readme (#180)
- adding aimo custom eval (#154)
- fix llm as judge warnings (#173)
- launch lighteval using
lighteval --args
(#152) - adds llm as judge using transformers (#223)
- Fix missing json file (#264)
- change priority when choosing model dtype (#263)
- fix the location of tasks list in the readme (#267)
- updates ifeval repo (#268)
- fix nanotron (#283)
- add vlmm backend (#274)
- bump nltk version (#290)
- @clefourrier
- Add config files for models (#131)
- Add fun widgets to the README (#145)
- Fix nanotron models input size bug (#156)
- no function we actually use should be named prompt_fn (#168)
- Add maj@k metric (#158)
- Homogeneize logging system (#150)
- Use only dataclasses for task init (#212)
- Now only uses functions for prompt definition (#213)
- Data split depending on eval params (#169)
- should fix most inference endpoints issues of version config (#226)
- Add metrics as functions (#214)
- Quantization related issues (#224)
- Update issue templates (#235)
- remove latex writer since we don't use it (#231)
- Removes default bert scorer init (#234)
- fix (#233)
- udpated piqa (#222)
- uses torch compile if provided (#248)
- Fix inference endpoint config (#244)
- Expose samples via the CLI (#228)
- Fixing issues with multichoice_continuations_start_space - was not parsed properly (#232)
- Programmatic interface + cleaner management of requests (#269)
- Small file reorg (only renames/moves) (#271)
- Refactoring the few shot management (#272)
- @PhilipMay
- @shaltielshmid
- @hynky1999