Skip to content

Code for Paper: Autonomous Evaluation and Refinement of Digital Agents [COLM 2024]

License

Notifications You must be signed in to change notification settings

Berkeley-NLP/Agent-Eval-Refine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logo

Autonomous Evaluation and Refinement of Digital Agents

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr

UC Berkeley, University of Michigan

COLM 2024 / MAR Workshop CVPR 2024 Best Paper

Overview

Overview

In this study, we design and use evaluation models to both evaluate and autonomously refine the performance of digital agents that browse the web or control mobile devices.

The evaluator and evaluation code is provided in ./agent_eval/ folder. You can use these models, either open weight or GPT-4V-based, to evaluate the performance of digital agents. Please refer to the Evaluation section for more details.

The refinement and the ios/android emulator code is provided in ./exps/ folder. It provides examples to execute/improve a variety of agents on WebArena/Android/iOS. Notably,

  • Reflexion + GPT-4 agent which achieves 20.2% on WebArena and is current state-of-the-art.
  • Refined CogAgent model which achieves 75% relative improvement in success rate on iOS.
  • A Python binding for iOS and Android emulator to facilitate refinement and end-to-end evaluation of digital agents.

Please refer to the Refinement section for more details.

We release all models, agent trajectories and dataset on Huggingface Hub.

News

  • July. 10. 2024: The paper is accepted at COLM 2024. We also won the best paper award at MAR Workshop of CVPR 2024
  • June. 14. 2024: We release DigiRL. Our 2B VLM, when post-trained with an autonomous evaluator (reward model), improves its success rate on Android device-control tasks from 17% to 67%.

Evaluation

Setup

First install the agent_eval package

cd agent_eval
pip install -e .

If you want to do inference with the captioner model, you need to additionally revert the transformers package to an old version

pip install transformers==4.32.0

Evaluate Agent Trajectories

You can evaluate agent trajectories by . You can download all agent trajectories used in the paper from this link.

Please visit the following files and change the configuration, setup the OpenAI API Key (for GPT-4) / Anyscale API key (for Mixtral), and run the following command to evaluate the agent trajectories.

cd ./agent_eval/agent_eval/scripts
# Select the right command according to the domain
python run_eval_web.py # for evaluating webarena agents
python run_eval_android.py # for evaluating android agents
python annotate_ios_dense.py # for providing dense annotations to iOS agents, later used as rewards in filtered-bc

Inspect/Annotate Agent Trajectories

We define a shared UnifiedTrajectory format to store agent trajectories, it's defined in ./agent_eval/agent_eval/domains/unified.py. To transform raw agent trajectories to UnifiedTrajectory, you can use the corresponding notebooks under ./agent_eval/agent_eval/domains/ folder.

You can inspect or provide human annotations to the agent trajectories by running the following command:

python -m agent_eval.eval.annotate_app --dataset <path-to-dataset> --log_name <log-name>

Captioner

The captioner VLM is used in the modular evaluator to provide dense descriptions of the screenshots, which is then feed into a LM to reason about the agent's behavior. We provide a demo, its weight, and training data on Huggingface Hub.

You can start the captioner server by running the following command:

python -m agent_eval.captioner.captioner_server --port <PORT_NUMBER>

./agent_eval/agent_eval/captioner also include

  • annotate_screenshots.py, code to annotate the screenshots with GPT-4V
  • gen_captions.sh, script to annotate a large number of screenshots with captions

Refinement

You can download all agent trajectories used in the experiment from this link.

Reflexion Agent on WebArena

Filtered-BC Refinement on iOS

  • The tasks we used are listed in exps/ios_exp/train_tasks.txt and exps/ios_exp/eval_tasks.txt
  • Please refer to exps/ios_exp/README.md for more details on how to reproduce the results.

Running Agents on Android

  • The tasks we used are listed in exps/android_exp/assets/instructions.txt
  • Please refer to exps/android_exp/README.md for more details on how to reproduce the results.

Filtered-BC Refinement on Android

  • We share the codebase with DigiRL for this part of experiment.

Citation

Please consider citing our paper if you find this project helpful for your research:

@misc{pan2024autonomous,
      title={Autonomous Evaluation and Refinement of Digital Agents}, 
      author={Jiayi Pan and Yichi Zhang and Nicholas Tomlin and Yifei Zhou and Sergey Levine and Alane Suhr},
      year={2024},
      eprint={2404.06474},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}