Release Note

This introduced the new extended tasks feature, documentation and many other patches for improved stability.
New tasks are also introduced:

MT-Bench marks the introduction of multi-turn prompting as well as llm-as-a-judge metric.

Add BBH by @clefourrier in #7, @bilgehanertan in #126
Add AGIEval by @clefourrier in #121
Adding TinyBench by @clefourrier in #104
Adding support for Arabic benchmarks : AlGhafa benchmarking suite by @alielfilali01 in #95
Add mt-bench by @NathanHB in #75

Extended Tasks ! by @clefourrier in #101, @lewtun in #108, @NathanHB in #122, #123
Added support for launching inference endpoint with different model dtypes by @shaltielshmid in #124

Adding LICENSE by @clefourrier in #86, @NathanHB in #89
Make it clearer in the README that the leaderboard uses the harness by @clefourrier in #94

Update huggingface-hub for compatibility with datasets 2.18 by @clefourrier in #84
Tidy up dependency groups by @lewtun in #81
bump git python by @NathanHB in #90
Sets a max length for the MATH task by @clefourrier in #83
Fix parallel data processing bug by @clefourrier in #92
Change the eos condition for GSM8K by @clefourrier in #85
Fixing rolling loglikelihood management by @clefourrier in #78
Fixes input length management for generative evals by @clefourrier in #103
Reorder addition of instruction in chat template by @clefourrier in #111
Ensure chat models terminate generation with EOS token by @lewtun in #115
Fix push details to hub by @NathanHB in #98
Small fixes to InferenceEndpointModel by @shaltielshmid in #112
Fix import typo autogptq by @clefourrier in #116
Fixed the loglikelihood method in inference endpoints models by @clefourrier in #119
Fix TextGenerationResponse import from hfh by @Wauplin in #129
Do not use deprecated list_files_info by @Wauplin in #133
Update test workflow name to 'Tests' by @Wauplin in #134

New Contributors

Full Changelog: v0.2.0...v0.3.0