Init

shirley-wu · Jun 20, 2024 · 45aeff8 · 45aeff8
commit 45aeff8
Show file tree

Hide file tree

Showing 90 changed files with 13,270 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+.idea/
+__pycache__/
+checkpoints/
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "viper/GLIP"]
+	path = viper/GLIP
+	url = https://github.com/sachit-menon/GLIP.git
diff --git a/README.md b/README.md
@@ -0,0 +1,138 @@
+# VDebugger
+
+This repo is for **VDebugger: Harnessing Execution Feedback for Debugging Visual Programs**
+
+[Paper](), [Website](https://shirley-wu.github.io/vdebugger/index.html)
+
+The training data and model are uploaded to huggingface: https://huggingface.co/VDebugger
+
+## Outlines
+
+- [Environment Setup](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#environment-setup)
+- [Dataset Setup](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#dataset-setup)
+- [Generation and Execution of Visual Programs](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#generation-and-execution-of-visual-programs)
+- [Inference of VDebugger](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#inference-of-vdebugger)
+
+## Environment Setup
+
+This code is partially adapted from [ViperGPT](https://github.com/cvlab-columbia/viper). We sincerely thank the authors for their great work!
+
+To setup the environment, you should:
+1. Clone recursively:
+```bash
+git clone --recurse-submodules https://github.com/cvlab-columbia/viper.git
+```
+2. Install pytorch based on your own environment. We installed `torch==2.1.2` with cuda 12.1
+3. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+4. Setup ViperGPT environments by:
+```bash
+cd viper
+bash download_models.sh
+export PATH=/usr/local/cuda/bin:$PATH
+cd GLIP
+python setup.py clean --all build develop --user
+```
+5. If you need to use openai APIs: write api key into `viper/qpi.key`
+
+## Dataset Setup
+
+Please follow the guidelines below to download each dataset:
+1. GQA: https://cs.stanford.edu/people/dorarad/gqa/download.html. The file structure should look as follows:
+```
+gqa/
+├── questions
+│   ├── readme.txt
+│   ├── {val, test, testdev, challenge}_{all, balanced}_questions.json
+│   ├── submission_all_questions.json
+│   ├── train_balanced_questions.json
+│   ├── train_all_questions/
+└── images
+    └── *.jpg
+```
+2. TallyQA: https://github.com/manoja328/TallyQA_dataset. The file structure should look as follows:
+```
+tallyqa/
+├── {test, train}.json
+└── {train2014, val2014, VG_100K, VG_100K_2}/
+    └── *.jpg
+```
+3. NLVRv2: https://github.com/lil-lab/nlvr/tree/master/nlvr2. The file structure should look as follows:
+```
+nlvr2/
+├── balanced_{dev, test1, test2, train}.jsonl
+└── {dev, test1, test2, train}/
+    └── *.png
+```
+4. RefCOCO*: https://github.com/lichengunc/refer. The file structure should look as follows:
+```
+refer/
+├── refcoco:
+│   ├── instances.json
+│   ├── refs(google).p
+│   └── refs(unc).p
+├── refcoco+:
+│   ├── instances.json
+│   └── refs(unc).p
+├── refcocog
+│   ├── instances.json
+│   ├── refs(google).p
+│   └── refs(umd).p
+└── {train2014, train2017, val2014, val2017}/
+    └── *.jpg
+```
+5. COVR: https://covr-dataset.github.io/. The file structure should look as follows:
+```
+covr/
+├── {train, val, test}.jsonl
+├── gqa_images
+│   └── *.jpg
+└── imSitu_images
+    └── {adjusting, ...}/
+        └── *.jpg
+```
+6. RSVG: https://github.com/ZhanYang-nwpu/RSVG-pytorch. The file structure should look as follows:
+```
+rsvg/
+├── {train, val, test.txt}
+├── Annotations/
+│   └── *.xml
+└── JPEGImages/
+    └── *.jpg
+```
+
+## Generation and Execution of Visual Programs
+
+Go to `viper/` for this step. We recommend first generating and then executing the visual programs in two separate steps. Take GQA dataset as an example:
+1. Generate programs:
+```bash
+CONFIG_NAMES=generate/gqa python main_batch_generate.py
+```
+This script will load the configuration under `config/generate/gqa.yaml`. Please remember to change YOUR_DATA_DIR into your data directory. The generated code will be saved in a csv under `code` field
+2. Execute and evaluate programs:
+```bash
+CONFIG_NAMES=execute/gqa python main_batch_execute.py
+```
+This script will load the configuration under `config/execute/gqa.yaml`. Please also remember to update YOUR_DATA_DIR, and change the `cached_codex_path:` field into the csv produced in step 1. The accuracy / IoU will be computed.
+3. If you want to obtain execution feedback:
+```bash
+CONFIG_NAMES=execute/gqa python main_batch_trace.py A_RANDOM_STAMP
+```
+You can use the same configuration as in step 2. If you want to run multiple `main_batch_trace.py` in the same time, please use different `A_RANDOM_STAMP` for different processes. The execution feedback will be saved in a csv under `traced` field.
+
+## Inference of VDebugger
+
+For inference with VDebugger, it is required to first generate and execute visual programs, and obtain a csv file containing `traced` field. Take GQA dataset and VDebugger/VDebugger-{critic, refiner}-generalist-13B as an example:
+```bash
+# Step 1: infer critic
+python infer_critic.py VDebugger/VDebugger-critic-generalist-13B --input YOUR_CSV_CONTAINING_TRACED_FIELD --dataset gqa  # output file will be written to critic-infer.csv
+# Step 2: infer refiner
+python infer_refine.py critic-infer.csv VDebugger/VDebugger-refiner-generalist-13B  # output file will be written to critic-refine-infer.csv
+```
+Then you can execute the programs in `critic-refine-infer.csv` as in step 2 of [Generation and Execution of Visual Programs](https://github.com/shirley-wu/vdebugger/tree/main?tab=readme-ov-file#generation-and-execution-of-visual-programs)
+
+## Training of VDebugger
+
+If you want to reproduce our training of VDebugger, please use `vdebugger/training_scripts/train_{critic, refiner}.sh`. You will need to install `deepspeed==0.14.0`.
diff --git a/docs/index.html b/docs/index.html
@@ -0,0 +1,242 @@
+<!DOCTYPE html>
+<html>
+<head>
+  <script src="https://kit.fontawesome.com/f8ddf9854a.js" crossorigin="anonymous"></script>
+  <meta charset="utf-8">
+  <meta name="description"
+        content="VDebugger: Harnessing Execution Feedback for Debugging Visual Programs">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <title>VDebugger</title>
+
+  <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
+        rel="stylesheet">
+
+  <link rel="stylesheet" href="./static/css/bulma.min.css">
+  <link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
+  <link rel="stylesheet" href="./static/css/bulma-slider.min.css">
+  <link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
+  <link rel="stylesheet"
+        href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
+  <link rel="stylesheet" href="./static/css/index.css">
+  <link rel="icon" href="./static/images/icon.svg">
+
+  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
+  <script defer src="./static/js/fontawesome.all.min.js"></script>
+  <script src="./static/js/bulma-carousel.min.js"></script>
+  <script src="./static/js/bulma-slider.min.js"></script>
+  <script src="./static/js/explorer-index.js"></script>
+  <script src="./static/js/question_card.js"></script>
+</head>
+<body>
+
+<section class="hero">
+  <div class="hero-body">
+    <div class="container is-max-desktop">
+      <div class="columns is-centered">
+        <div class="column has-text-centered">
+          <h1 class="title is-1 publication-title is-bold">
+            <img src="./static/images/icon.jpg" style="width:1em;vertical-align: middle" alt="Logo"/>
+            <span style="vertical-align: middle">VDebugger</span>
+            </h1>
+          <h2 class="subtitle is-4 publication-subtitle">
+            Harnessing Execution Feedback for Debugging Visual Programs
+          </h2>
+          <div class="is-size-5 publication-authors">
+            <span class="author-block">
+              <a href="https://shirley-wu.github.io/">Xueqing Wu</a>,</span>
+            <span class="author-block">
+              <a href="https://rafa-zy.github.io/">Zongyu Lin</a>,</span>
+            <span class="author-block">
+              <a href="https://www.linkedin.com/in/songyan-silas-zhao/">Songyan Zhao</a>,</span>
+            <span class="author-block">
+              <a href="https://scholar.google.com/citations?user=Q5aezXQAAAAJ/">Te-Lin Wu</a>,</span>
+            <span class="author-block">
+              <a href="https://lupantech.github.io/">Pan Lu</a>,</span>
+            <span class="author-block">
+              <a href="https://vnpeng.net/">Nanyun Peng</a>,</span>
+            <span class="author-block">
+              <a href="http://web.cs.ucla.edu/~kwchang/">Kai-Wei Chang</a></span>
+          </div>
+
+          <div class="is-size-5 publication-authors">
+            <span class="author-block">University of California Los Angeles</span>
+          </div>
+
+          <div class="column has-text-centered">
+            <div class="publication-links">
+              <!-- PDF Link. -->
+              <span class="link-block">
+                <a href=""
+                   class="external-link button is-normal is-rounded is-dark">
+                  <span class="icon">
+                      <i class="fas fa-file-pdf"></i>
+                  </span>
+                  <span>Paper</span>
+                </a>
+              </span>
+              <span class="link-block">
+                <a href="https://github.com/shirley-wu/vdebugger/"
+                   class="external-link button is-normal is-rounded is-dark">
+                  <span class="icon">
+                      <i class="fab fa-github"></i>
+                  </span>
+                  <span>Code</span>
+                  </a>
+              </span>
+              <span class="link-block">
+                <a href="https://huggingface.co/VDebugger"
+                   class="external-link button is-normal is-rounded is-dark">
+                  <span class="icon">
+                      <p style="font-size:18px">🤗</p>
+                  </span>
+                  <span>Models and Data</span>
+                </a>
+              </span>
+            </div>
+          </div>
+        </div>
+      </div>
+    </div>
+  </div>
+</section>
+
+<section class="hero teaser">
+  <div class="container is-max-desktop">
+        <div class="content has-text-centered">
+          <img src="static/images/teaser.jpg" alt="Overview of VDebugger." width="100%"/>
+        </div>
+      <!-- </div> -->
+    </div>
+  </div>
+</section>
+
+<section class="section">
+  <div class="container is-max-desktop">
+    <div class="columns is-centered">
+      <div class="column has-text-centered">
+<!--        <h2 class="title is-3">Introduction</h2>-->
+
+        <div class="content has-text-justified">
+          <p>
+            <i>Visual programs</i> are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and in005 voke specialized models for each step to solve the problems.
+          </p>
+            <p>However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bot011 tleneck for visual reasoning.
+                </p>
+            <p>To address this, we introduce <b>VDebugger</b>, a novel <i>critic-refiner framework</i> trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects pro016 gram errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique.
+                </p>
+            <p>Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger’s ability to generalize to un026 seen tasks, bringing a notable improvement of 2.3% on the unseen COVR task.</p>
+          </p>
+        </div>
+      </div>
+    </div>
+  </div>
+</section>
+
+<section class="hero teaser">
+  <div class="container is-max-desktop">
+        <div class="content has-text-centered">
+          <img src="static/images/comparison.jpg" alt="Comparison against existing work." width="100%"/>
+          <p>Comparison against existing work.</p>
+        </div>
+      <!-- </div> -->
+    </div>
+  </div>
+</section>
+
+<section class="section">
+  <div class="container">
+    <div class="columns is-centered has-text-centered">
+      <div class="column is-four-fifths">
+        <h2 class="title is-3" id="examples">Results</h2>
+        <div id="results-carousel" class="carousel results-carousel">
+          <div class="box m-5">
+            <div class="content has-text-centered">
+              <img src="static/images/result_1.jpg" width="95%"/>
+              <p><b>Main results.</b><br/>The two baselines, SelfDebug and LDB, slightly hurt the performance, while our VDebugger consistently improves the performance in every dataset by up to 3.2% accuracy.
+            </div>
+          </div>
+          <div class="box m-5">
+            <div class="content has-text-centered">
+              <img src="static/images/result_2.jpg" width="80%"/>
+              <p><b>Ablation study.</b><br/>The critic consistently achieves high accuracy, but the refiner success rate is less reliable.
+                The execution feedback consistently brings benefits to critic accuracy and the final performance, but the benefits to refiner performance are minimal.
+               <span style="color:red">This shows that the remaining challenges mainly lie in correcting the program after the errors are identified.</span></p>
+            </div>
+          </div>
+          <div class="box m-5">
+            <div class="content has-text-centered">
+              <img src="static/images/result_3.jpg" width="50%"/>
+              <p><b>Generalization to unseen LLMs:</b> VDebugger can debug visual programs generated by larger LLMs, including CodeLlama-70b, DeepSeek-Coder-33B and GPT-3.5.</p>
+            </div>
+          </div>
+          <div class="box m-5">
+            <div class="content has-text-centered">
+              <img src="static/images/result_4.jpg" width="40%"/>
+              <p><b>Generalization to unseen tasks:</b> when trained on all six datasets, the generalist VDebugger can generalize to two unseen tasks:
+                (1) RSVG, visual grounding for remote sensing images, and (2) COVR, an unseen task form requiring question answering based on a variable number of images.</p>
+            </div>
+          </div>
+        </div>
+      </div>
+    </div>
+  </div>
+</section>
+
+<section class="section">
+  <div class="container">
+    <div class="columns is-centered has-text-centered">
+      <div class="column is-four-fifths">
+        <h2 class="title is-3" id="q">Qualitative Analysis</h2>
+        <div id="qq" class="carousel results-carousel">
+          <div class="box m-5">
+            <div class="content has-text-centered">
+              <img src="static/images/error_breakdown_v.jpg" width="40%"/>
+              <p><b>Sources of errors.</b><br/>Program errors significantly affect the end performance. VDebugger consistently reduces program errors on all datasets, and can can also help recover from foundation VLM errors especially on RefCOCOg.</p>
+            </div>
+          </div>
+          <div class="box m-5">
+            <div class="content has-text-centered">
+              <img src="static/images/case_study.jpg" width="95%"/>
+              <p><b>Example where VDebugger fixes program error.</b></p>
+            </div>
+          </div>
+          <div class="box m-5">
+            <div class="content has-text-centered">
+              <img src="static/images/case_study2.jpg" width="95%"/>
+              <p><b>Example where VDebugger recovers from foundation model error.</b><br/>The question answering model yields incorrect
+answer "vanity" in the original program. By detecting this error, VDebugger invokes the foundation VLMs in an alternative way
+and thus obtains the correct answer.</p>
+            </div>
+          </div>
+        </div>
+      </div>
+    </div>
+  </div>
+</section>
+
+<section class="section" id="BibTeX">
+  <div class="container is-max-desktop content">
+    <h2 class="title">BibTeX</h2>
+    <pre><code>
+      TODO
+    </code></pre>
+  </div>
+</section>
+
+<footer class="footer">
+  <div class="container">
+    <div class="columns is-centered">
+      <div class="column is-8">
+        <div class="content">
+          <p>
+            This website is website adapted from <a href="https://nerfies.github.io/">Nerfies</a> and <a href="https://mathvista.github.io/">MathVista</a>, licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative
+            Commons Attribution-ShareAlike 4.0 International License</a>.
+          </p>
+        </div>
+      </div>
+    </div>
+  </div>
+</footer>
+
+</body>
+</html>