novel benchmark for probing the visual reasoning capabilities of large language models
- Clone the repository
git clone https://github.com/AarushSah/Set_Eval.git
- Install the requirements
pip install -r requirements.txt
- Add environment variables to the .env file
cp .env.example .env
- Run the evaluation for Claude-3.5 Sonnet:
inspect eval evaluation.py --model anthropic/claude-3-5-sonnet-20240620
- Run the evaluation for GPT-4o:
inspect eval evaluation.py --model openai/gpt-4o
- View Results:
inspect view
Special thanks to Zack Witten for the idea!