llms-benchmarking

CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.

benchmark reasoning vision-and-language multimodal-deep-learning human-annotation foundation-models large-language-models llms vision-language-model multimodal-large-language-models evaluation-llms llms-benchmarking

Updated Aug 6, 2024
Jupyter Notebook

epfl-dlab / cc_flows

Star

The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".

ai competitive-programming agents competitive-programming-contests competitive-coding llms llms-reasoning llms-benchmarking aiflows

Updated Feb 12, 2024
Python

amazon-science / llm-code-preference

Star

Training and Benchmarking LLMs for Code Preference.

code-generation llm-training llm-evaluation llms-benchmarking

Updated Nov 15, 2024
Python

declare-lab / resta

Star

Restore safety in fine-tuned language models through task arithmetic

alignment safety alignment-algorithm llm llms llm-safety llms-benchmarking llm-safety-benchmark

Updated Mar 28, 2024
Python

Laoyu84 / 4onebench

Star

A minimalist benchmarking tool designed to test the routine-generation capabilities of LLMs.

agents large-language-models llms-benchmarking

Updated Nov 28, 2024
Python

multinear / multinear

Star

Develop reliable AI apps

reliability evaluation llm llms llm-eval llm-evaluation llms-benchmarking llm-evaluation-framework

Updated Dec 3, 2024
Svelte

minnesotanlp / cobbler

Star

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

nlp evaluation bias bias-detection llm llms llm-evaluation llms-benchmarking llm-as-judge llm-as-a-judge llm-as-evaluator

Updated Feb 16, 2024
Jupyter Notebook

Paulescu / text-embedding-evaluation

Star

Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️

machine-learning embeddings llms llms-benchmarking

Updated Apr 19, 2024
Python

logikon-ai / cot-eval

Star

A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

leaderboard llm chain-of-thought gen-ai llms-reasoning llms-benchmarking

Updated Oct 6, 2024
Jupyter Notebook

nachoDRT / MERIT-Dataset

Star

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.

biases synthetic-dataset-generation layoutlm synthetic-dataset layoutxlm token-classification layoutlmv3 layoutlmv2 llms-benchmarking

Updated Sep 6, 2024
Python

lechmazur / nyt-connections

Star

Benchmark that evaluates LLMs using 436 NYT Connections puzzles

testing benchmark evaluation puzzles reasoning llm llms-benchmarking gpt-4o

Updated Nov 5, 2024
Python

SuperBruceJia / Awesome-Mixture-of-Experts

Star

Awesome Mixture of Experts (MoE): A Curated List of Mixture of Experts (MoE) and Mixture of Multimodal Experts (MoME)

Updated Sep 25, 2024

cosmaadrian / romath

Star

Official repository for "RoMath: A Mathematical Reasoning Benchmark in 🇷🇴 Romanian 🇷🇴"

mathematics dataset romanian llms-benchmarking

Updated Sep 23, 2024
Python

Improve this page

Add a description, image, and links to the llms-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llms-benchmarking topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llms-benchmarking

Here are 44 public repositories matching this topic...

ChemFoundationModels / ChemLLMBench

lerogo / MMGenBench

bboylyg / BackdoorLLM

parea-ai / parea-sdk-py

JonathanChavezTamales / LLMStats

lamalab-org / chem-bench

FSoft-AI4Code / XMainframe

RaptorMai / CompBench

epfl-dlab / cc_flows

amazon-science / llm-code-preference

declare-lab / resta

Laoyu84 / 4onebench

multinear / multinear

minnesotanlp / cobbler

Paulescu / text-embedding-evaluation

logikon-ai / cot-eval

nachoDRT / MERIT-Dataset

lechmazur / nyt-connections

SuperBruceJia / Awesome-Mixture-of-Experts

cosmaadrian / romath

Improve this page

Add this topic to your repo