Skip to content

Commit

Permalink
Init
Browse files Browse the repository at this point in the history
  • Loading branch information
Xueqing Wu committed Jun 20, 2024
0 parents commit 45aeff8
Show file tree
Hide file tree
Showing 90 changed files with 13,270 additions and 0 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.idea/
__pycache__/
checkpoints/
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "viper/GLIP"]
path = viper/GLIP
url = https://github.com/sachit-menon/GLIP.git
138 changes: 138 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# VDebugger

This repo is for **VDebugger: Harnessing Execution Feedback for Debugging Visual Programs**

[Paper](), [Website](https://shirley-wu.github.io/vdebugger/index.html)

The training data and model are uploaded to huggingface: https://huggingface.co/VDebugger

## Outlines

- [Environment Setup](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#environment-setup)
- [Dataset Setup](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#dataset-setup)
- [Generation and Execution of Visual Programs](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#generation-and-execution-of-visual-programs)
- [Inference of VDebugger](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#inference-of-vdebugger)

## Environment Setup

This code is partially adapted from [ViperGPT](https://github.com/cvlab-columbia/viper). We sincerely thank the authors for their great work!

To setup the environment, you should:
1. Clone recursively:
```bash
git clone --recurse-submodules https://github.com/cvlab-columbia/viper.git
```
2. Install pytorch based on your own environment. We installed `torch==2.1.2` with cuda 12.1
3. Install dependencies:
```bash
pip install -r requirements.txt
```
4. Setup ViperGPT environments by:
```bash
cd viper
bash download_models.sh
export PATH=/usr/local/cuda/bin:$PATH
cd GLIP
python setup.py clean --all build develop --user
```
5. If you need to use openai APIs: write api key into `viper/qpi.key`

## Dataset Setup

Please follow the guidelines below to download each dataset:
1. GQA: https://cs.stanford.edu/people/dorarad/gqa/download.html. The file structure should look as follows:
```
gqa/
├── questions
│ ├── readme.txt
│ ├── {val, test, testdev, challenge}_{all, balanced}_questions.json
│ ├── submission_all_questions.json
│ ├── train_balanced_questions.json
│ ├── train_all_questions/
└── images
└── *.jpg
```
2. TallyQA: https://github.com/manoja328/TallyQA_dataset. The file structure should look as follows:
```
tallyqa/
├── {test, train}.json
└── {train2014, val2014, VG_100K, VG_100K_2}/
└── *.jpg
```
3. NLVRv2: https://github.com/lil-lab/nlvr/tree/master/nlvr2. The file structure should look as follows:
```
nlvr2/
├── balanced_{dev, test1, test2, train}.jsonl
└── {dev, test1, test2, train}/
└── *.png
```
4. RefCOCO*: https://github.com/lichengunc/refer. The file structure should look as follows:
```
refer/
├── refcoco:
│ ├── instances.json
│ ├── refs(google).p
│ └── refs(unc).p
├── refcoco+:
│ ├── instances.json
│ └── refs(unc).p
├── refcocog
│ ├── instances.json
│ ├── refs(google).p
│ └── refs(umd).p
└── {train2014, train2017, val2014, val2017}/
└── *.jpg
```
5. COVR: https://covr-dataset.github.io/. The file structure should look as follows:
```
covr/
├── {train, val, test}.jsonl
├── gqa_images
│ └── *.jpg
└── imSitu_images
└── {adjusting, ...}/
└── *.jpg
```
6. RSVG: https://github.com/ZhanYang-nwpu/RSVG-pytorch. The file structure should look as follows:
```
rsvg/
├── {train, val, test.txt}
├── Annotations/
│ └── *.xml
└── JPEGImages/
└── *.jpg
```

## Generation and Execution of Visual Programs

Go to `viper/` for this step. We recommend first generating and then executing the visual programs in two separate steps. Take GQA dataset as an example:
1. Generate programs:
```bash
CONFIG_NAMES=generate/gqa python main_batch_generate.py
```
This script will load the configuration under `config/generate/gqa.yaml`. Please remember to change YOUR_DATA_DIR into your data directory. The generated code will be saved in a csv under `code` field
2. Execute and evaluate programs:
```bash
CONFIG_NAMES=execute/gqa python main_batch_execute.py
```
This script will load the configuration under `config/execute/gqa.yaml`. Please also remember to update YOUR_DATA_DIR, and change the `cached_codex_path:` field into the csv produced in step 1. The accuracy / IoU will be computed.
3. If you want to obtain execution feedback:
```bash
CONFIG_NAMES=execute/gqa python main_batch_trace.py A_RANDOM_STAMP
```
You can use the same configuration as in step 2. If you want to run multiple `main_batch_trace.py` in the same time, please use different `A_RANDOM_STAMP` for different processes. The execution feedback will be saved in a csv under `traced` field.

## Inference of VDebugger

For inference with VDebugger, it is required to first generate and execute visual programs, and obtain a csv file containing `traced` field. Take GQA dataset and VDebugger/VDebugger-{critic, refiner}-generalist-13B as an example:
```bash
# Step 1: infer critic
python infer_critic.py VDebugger/VDebugger-critic-generalist-13B --input YOUR_CSV_CONTAINING_TRACED_FIELD --dataset gqa # output file will be written to critic-infer.csv
# Step 2: infer refiner
python infer_refine.py critic-infer.csv VDebugger/VDebugger-refiner-generalist-13B # output file will be written to critic-refine-infer.csv
```
Then you can execute the programs in `critic-refine-infer.csv` as in step 2 of [Generation and Execution of Visual Programs](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#generation-and-execution-of-visual-programs)

## Training of VDebugger

If you want to reproduce our training of VDebugger, please use `vdebugger/training_scripts/train_{critic, refiner}.sh`. You will need to install `deepspeed==0.14.0`.
242 changes: 242 additions & 0 deletions docs/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
<!DOCTYPE html>
<html>
<head>
<script src="https://kit.fontawesome.com/f8ddf9854a.js" crossorigin="anonymous"></script>
<meta charset="utf-8">
<meta name="description"
content="VDebugger: Harnessing Execution Feedback for Debugging Visual Programs">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>VDebugger</title>

<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">

<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/icon.svg">

<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/explorer-index.js"></script>
<script src="./static/js/question_card.js"></script>
</head>
<body>

<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title is-bold">
<img src="./static/images/icon.jpg" style="width:1em;vertical-align: middle" alt="Logo"/>
<span style="vertical-align: middle">VDebugger</span>
</h1>
<h2 class="subtitle is-4 publication-subtitle">
Harnessing Execution Feedback for Debugging Visual Programs
</h2>
<div class="is-size-5 publication-authors">
<span class="author-block">
<a href="https://shirley-wu.github.io/">Xueqing Wu</a>,</span>
<span class="author-block">
<a href="https://rafa-zy.github.io/">Zongyu Lin</a>,</span>
<span class="author-block">
<a href="https://www.linkedin.com/in/songyan-silas-zhao/">Songyan Zhao</a>,</span>
<span class="author-block">
<a href="https://scholar.google.com/citations?user=Q5aezXQAAAAJ/">Te-Lin Wu</a>,</span>
<span class="author-block">
<a href="https://lupantech.github.io/">Pan Lu</a>,</span>
<span class="author-block">
<a href="https://vnpeng.net/">Nanyun Peng</a>,</span>
<span class="author-block">
<a href="http://web.cs.ucla.edu/~kwchang/">Kai-Wei Chang</a></span>
</div>

<div class="is-size-5 publication-authors">
<span class="author-block">University of California Los Angeles</span>
</div>

<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href=""
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<span class="link-block">
<a href="https://github.com/shirley-wu/vdebugger/"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
<span class="link-block">
<a href="https://huggingface.co/VDebugger"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<p style="font-size:18px">🤗</p>
</span>
<span>Models and Data</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>

<section class="hero teaser">
<div class="container is-max-desktop">
<div class="content has-text-centered">
<img src="static/images/teaser.jpg" alt="Overview of VDebugger." width="100%"/>
</div>
<!-- </div> -->
</div>
</div>
</section>

<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<!-- <h2 class="title is-3">Introduction</h2>-->

<div class="content has-text-justified">
<p>
<i>Visual programs</i> are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and in005 voke specialized models for each step to solve the problems.
</p>
<p>However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bot011 tleneck for visual reasoning.
</p>
<p>To address this, we introduce <b>VDebugger</b>, a novel <i>critic-refiner framework</i> trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects pro016 gram errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique.
</p>
<p>Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger’s ability to generalize to un026 seen tasks, bringing a notable improvement of 2.3% on the unseen COVR task.</p>
</p>
</div>
</div>
</div>
</div>
</section>

<section class="hero teaser">
<div class="container is-max-desktop">
<div class="content has-text-centered">
<img src="static/images/comparison.jpg" alt="Comparison against existing work." width="100%"/>
<p>Comparison against existing work.</p>
</div>
<!-- </div> -->
</div>
</div>
</section>

<section class="section">
<div class="container">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3" id="examples">Results</h2>
<div id="results-carousel" class="carousel results-carousel">
<div class="box m-5">
<div class="content has-text-centered">
<img src="static/images/result_1.jpg" width="95%"/>
<p><b>Main results.</b><br/>The two baselines, SelfDebug and LDB, slightly hurt the performance, while our VDebugger consistently improves the performance in every dataset by up to 3.2% accuracy.
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="static/images/result_2.jpg" width="80%"/>
<p><b>Ablation study.</b><br/>The critic consistently achieves high accuracy, but the refiner success rate is less reliable.
The execution feedback consistently brings benefits to critic accuracy and the final performance, but the benefits to refiner performance are minimal.
<span style="color:red">This shows that the remaining challenges mainly lie in correcting the program after the errors are identified.</span></p>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="static/images/result_3.jpg" width="50%"/>
<p><b>Generalization to unseen LLMs:</b> VDebugger can debug visual programs generated by larger LLMs, including CodeLlama-70b, DeepSeek-Coder-33B and GPT-3.5.</p>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="static/images/result_4.jpg" width="40%"/>
<p><b>Generalization to unseen tasks:</b> when trained on all six datasets, the generalist VDebugger can generalize to two unseen tasks:
(1) RSVG, visual grounding for remote sensing images, and (2) COVR, an unseen task form requiring question answering based on a variable number of images.</p>
</div>
</div>
</div>
</div>
</div>
</div>
</section>

<section class="section">
<div class="container">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3" id="q">Qualitative Analysis</h2>
<div id="qq" class="carousel results-carousel">
<div class="box m-5">
<div class="content has-text-centered">
<img src="static/images/error_breakdown_v.jpg" width="40%"/>
<p><b>Sources of errors.</b><br/>Program errors significantly affect the end performance. VDebugger consistently reduces program errors on all datasets, and can can also help recover from foundation VLM errors especially on RefCOCOg.</p>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="static/images/case_study.jpg" width="95%"/>
<p><b>Example where VDebugger fixes program error.</b></p>
</div>
</div>
<div class="box m-5">
<div class="content has-text-centered">
<img src="static/images/case_study2.jpg" width="95%"/>
<p><b>Example where VDebugger recovers from foundation model error.</b><br/>The question answering model yields incorrect
answer "vanity" in the original program. By detecting this error, VDebugger invokes the foundation VLMs in an alternative way
and thus obtains the correct answer.</p>
</div>
</div>
</div>
</div>
</div>
</div>
</section>

<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>
TODO
</code></pre>
</div>
</section>

<footer class="footer">
<div class="container">
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This website is website adapted from <a href="https://nerfies.github.io/">Nerfies</a> and <a href="https://mathvista.github.io/">MathVista</a>, licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
</div>
</div>
</div>
</div>
</footer>

</body>
</html>
Loading

0 comments on commit 45aeff8

Please sign in to comment.