Skip to content

novel benchmark for probing the visual reasoning capabilities of large language models

License

Notifications You must be signed in to change notification settings

AarushSah/Set_Eval

Repository files navigation

Set_Eval

novel benchmark for probing the visual reasoning capabilities of large language models

Quickstart:

  1. Clone the repository
    git clone https://github.com/AarushSah/Set_Eval.git
  2. Install the requirements
    pip install -r requirements.txt
  3. Add environment variables to the .env file
    cp .env.example .env
  4. Run the evaluation for Claude-3.5 Sonnet:
    inspect eval evaluation.py --model anthropic/claude-3-5-sonnet-20240620
  5. Run the evaluation for GPT-4o:
    inspect eval evaluation.py --model openai/gpt-4o
  6. View Results:
    inspect view

Acknowledegements:

Special thanks to Zack Witten for the idea!

About

novel benchmark for probing the visual reasoning capabilities of large language models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages