Evaluate your LLM's response with Prometheus and GPT4 💯
-
Updated
Sep 9, 2024 - Python
Evaluate your LLM's response with Prometheus and GPT4 💯
CodeUltraFeedback: aligning large language models to coding preferences
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
This is the repo for the survey of Bias and Fairness in IR with LLMs.
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Timo: Towards Better Temporal Reasoning for Language Models (COLM 2024)
A comprehensive study of the LLM-as-a-judge paradigm in a controlled setup that reveals new results about its strengths and weaknesses.
A set of examples demonstrating how to evaluate Generative AI augmented systems using traditional information retrieval and LLM-As-A-Judge validation techniques
Notebooks for evaluating LLM based applications using the Model (LLM) as a judge pattern.
Use groq for evaluations
Antibodies for LLMs hallucinations (grouping LLM as a judge, NLI, reward models)
Explore techniques to use small models as jailbreaking judges
Add a description, image, and links to the llm-as-a-judge topic page so that developers can more easily learn about it.
To associate your repository with the llm-as-a-judge topic, visit your repo's landing page and select "manage topics."