🖋 Authors: Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, Bill Yuchen Lin
We introduce 🪄Lumos, Language Agents with Unified Data Formats, Modular Design, and Open-Source LLMs. Lumos unifies a suite of complex interactive tasks and achieves competitive performance with GPT-4/3.5-based and larger open-source agents.
- 🧩 Modular Architecture:
- 🧩 Lumos consists of planning, grounding, and execution modules built based on LLAMA-2-7B/13B and off-the-shelf APIs.
- 🤗 Lumos utilizes a unified data format that encompasses multiple task types, thereby enabling the developed agent framework to conveniently support a range of interactive tasks.
- 🌍 Diverse Training Data:
- 🌍 Lumos is trained with ~56K diverse high-quality subgoal/action annotations from ground-truth reasoning steps in existing benchmarks with GPT-4.
- ⚒️ Lumos data can be instrumental for future research in developing open-source agents for complex interactive tasks.
- 🚀 Competitive Performance:
- 🚀 Lumos is comparable or even beats GPT-series agents on web/complex QA tasks Mind2Web and HotpotQA, and larger open agents on math and multimodal tasks.
- 🚀 Lumos exceeds contemporaneous agents that have been fine-tuned with in-domain HotpotQA, Mind2Web and ScienceQA annotations, such as FiReAct, AgentLM, and AutoAct.
- 🚀 Lumos performs better than open agent baseline formulations including chain-of-thoughts and integrated training.
- 🚀 Lumos surpasses larger open LLM agents and domain-specific agents on unseen tasks, WebShop and InterCode_SQL.
If you find this work is relevant with your research, please feel free to cite our work!
@article{yin2023lumos,
title={{Agent Lumos: Unified and Modular Training for Open-Source Language Agents}},
author={Yin, Da and Brahman, Faeze and Ravichander, Abhilasha and Chandu, Khyathi and Chang, Kai-Wei and Choi, Yejin and Lin, Bill Yuchen},
journal={arXiv preprint arXiv:2311.05657},
year={2023}
}
- [2024, Mar 18] We release the latest Lumos version:
- 📑 Lumos paper that covers new multimodal tasks and 13B-scale model experiments
- 🤗 Lumos demo that illustrates Lumos planning and grounding processes
- [2023, Nov 8] We release the important items for training and evaluating Lumos:
- 💻 Lumos code for annotation generation, training and evaluation
- 🤗 Lumos checkpoints with 7B model size
- 🤗 Lumos training annotations and their raw data
./setup.sh
Please make sure that the cudatoolkit version in setup.sh
aligns with your local cuda version.
We collect all the training annotations, raw data and prompt converted annotations in a single Google Drive folder. It can be downloaded by
cd data
python -c "import gdown; gdown.download_folder('https://drive.google.com/drive/folders/1ASFhOkhezgewVxR01dQg-8KUVR8IdBlY?usp=sharing', quiet=True)"
We also provide generated annotations for planning and grounding modules in 🤗 Huggingface Datasets.
Dataset Names | 🤗 Huggingface Links |
---|---|
lumos_complex_qa_iterative | Planning, Grounding |
lumos_complex_qa_onetime | Planning, Grounding |
lumos_web_agent_iterative | Planning, Grounding |
lumos_multimodal_iterative | Planning, Grounding |
lumos_maths_iterative | Planning, Grounding |
lumos_maths_onetime | Planning, Grounding |
lumos_unified_iterative | Planning, Grounding |
./train.sh [MODULE] [FORMULATION]
[MODULE]
can be either plan
or ground
. [FORMULATION]
can be either iterative
or onetime
.
You can adjust the fine-tuning hyperparameters and specific task you want to fine-tune in the training scripts such as finetune_llama2_plan_iterative.sh
in scripts/train
.
We also provide the fine-tuned planning and grounding module checkpoints in 🤗 Huggingface.
Model Names | 🤗 Huggingface Links |
---|---|
lumos_complex_qa_iterative | Planning, Grounding |
lumos_complex_qa_iterative-13B | Planning, Grounding |
lumos_complex_qa_onetime | Planning, Grounding |
lumos_web_agent_iterative | Planning, Grounding |
lumos_web_agent_iterative-13B | Planning, Grounding |
lumos_maths_iterative | Planning, Grounding |
lumos_maths_onetime | Planning, Grounding |
lumos_maths_onetime-13B | Planning, Grounding |
lumos_unified_iterative | Planning, Grounding |
lumos_unified_iterative-13B | Planning, Grounding |
Evaluation scripts for different datasets are under scripts/eval
. For example, you can evaluate Lumos on HotpotQA by running:
./scripts/eval/hotpotqa.sh
We provide the code for generating training annotations based on raw existing benchmarks from scratch.
Before generating annotations, we first need to download the existing benchmarks providing ground-truth intermediate reasoning steps. The raw data are can be downloaded via this Google Drive folder.
python -m data.prompt_convertion \
--domain DOMAIN \
--data_fn DATA_FN \
--convert_all
domain
covers maths, complex QA, web agent, multimodal. data_fn
is the path where raw benchmarks are stored.
For multimodal task annotation generation, please download COCO 2017 train images in data/train/multimodal/raw_data
and unzip it.
We greatly thank Tulu team for providing awesome code to finetune LLAMA-2. We also sincerely appreciate the contributors of zeno-build, Mind2Web, and WebShop for providing fast GPT prompting, HTML preprocessing and evaluation docker environment.