Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add a wrapper for solve #2

Closed
wants to merge 43 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
04fad95
feat: Add a wrapper to the `solve` command
dustinbyrne Jun 1, 2024
f65005b
ci: Add an action to run `solve` over SWE bench
dustinbyrne Jun 3, 2024
5da05dc
fix: Write issue descriptions into individual directories
kgilpin Jun 3, 2024
5b90c73
fixup! Checkout base commit, add `--keep` argument
dustinbyrne Jun 3, 2024
dae867a
drop: Run only requests tests
kgilpin Jun 3, 2024
f05235a
feat: Capture solver logs
kgilpin Jun 3, 2024
484c84c
feat: Support linting
kgilpin Jun 3, 2024
e883e41
fixup
kgilpin Jun 3, 2024
2f2eecf
Run with a single process pool, use swebench env management
dustinbyrne Jun 3, 2024
c7c42b8
fixup! Run with a single process pool, use swebench env management
dustinbyrne Jun 3, 2024
12e1386
Merge branch 'feat/limited-run-requests-only' into feat/solve
dustinbyrne Jun 3, 2024
5045fcb
filter sympy
dustinbyrne Jun 3, 2024
ca6a13c
include conda path
dustinbyrne Jun 3, 2024
74c37cd
validate conda path exists
dustinbyrne Jun 3, 2024
f25b78e
update appmap-js
dustinbyrne Jun 3, 2024
534fc08
drop: ssh debug
dustinbyrne Jun 4, 2024
c7e03e5
Revert "drop: ssh debug"
dustinbyrne Jun 4, 2024
9c0bfdf
Add reporting script
dustinbyrne Jun 4, 2024
b1ba0d0
generate and output csv to artifacts
dustinbyrne Jun 4, 2024
804c978
skip problematic task envs
dustinbyrne Jun 4, 2024
d95df5c
feat: Bring solver Python code over from appmap-js
kgilpin Jun 4, 2024
18edaaa
drop: Run 'requests' (smaller)
kgilpin Jun 4, 2024
510901e
ci: Use 8 workers
kgilpin Jun 4, 2024
7682d0c
fixup! feat: Bring solver Python code over from appmap-js
kgilpin Jun 4, 2024
0a4fcf0
feat: Dedicated log files for lint and diff
kgilpin Jun 4, 2024
f68311e
feat: Ignore more lint codes
kgilpin Jun 4, 2024
eaa6a25
ci: Use 6 workers
kgilpin Jun 4, 2024
2a34c26
drop: Filter astropy
kgilpin Jun 4, 2024
0de490e
catch testbed initialization errors
dustinbyrne Jun 4, 2024
923bf3c
feat: Log commands into their own file
kgilpin Jun 4, 2024
12edc9e
re-daemonize pool
dustinbyrne Jun 4, 2024
fb616d0
feat: More lint ignores
kgilpin Jun 4, 2024
8396761
refactor: Use relative imports
kgilpin Jun 4, 2024
927f73d
Merge pull request #9 from getappmap/feat/log-commands
kgilpin Jun 4, 2024
57ad377
chore: Navie-driven workflow improvements
dividedmind Jun 4, 2024
4fae628
fix: Don't try to create the same conda env many times in parallel
dividedmind Jun 4, 2024
bbd3035
ci: Parametrize workflow_dispatch on the dataset
dividedmind Jun 4, 2024
6e1c1ca
ci: Use correct dir for conda cache
dividedmind Jun 4, 2024
c7457df
ci: Build appmap-js even with cache hit
dividedmind Jun 4, 2024
df66c85
feat: --retry until a patch is applied
kgilpin Jun 4, 2024
9dea467
ci: Add retries parameter
kgilpin Jun 4, 2024
3407b27
ci: Only run on labeled PRs
dustinbyrne Jun 4, 2024
820e05d
fixup! ci: Only run on labeled PRs
dustinbyrne Jun 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions .github/workflows/solve.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
on:
workflow_dispatch:
inputs:
filter:
description: "Instance filter"
required: true
default: marshmallow
dataset:
description: "Dataset name"
required: true
default: princeton-nlp/SWE-bench_Lite
split:
description: "Dataset split"
required: true
default: dev
retries:
description: "Number of retries to perform on each instance until a patch is found"
required: false
default: "3"

pull_request:

jobs:
solve:
if: ${{ contains(github.event.pull_request.labels.*.name, 'evaluate') || github.event_name == 'workflow_dispatch' }}
runs-on: swe-bench-ubuntu-latest
defaults:
run:
shell: bash -leo pipefail {0}
steps:
- name: Checkout
uses: actions/checkout@v3
with:
submodules: true

- name: Set up Python
uses: actions/setup-python@v4

# Cache the conda environment
- name: Cache conda environment
id: cache-conda
uses: actions/cache@v3
with:
path: /usr/share/miniconda/envs/swe-bench
key: conda-${{ runner.os }}-${{ hashFiles('environment.yml') }}

# Create conda env if cache miss happens
- name: Create conda env
if: steps.cache-conda.outputs.cache-hit != 'true'
run: |
conda init bash
conda env create -f environment.yml
pip install flake8 black

# Cache the appmap-js build
- name: Cache appmap-js build
uses: actions/cache@v3
id: cache-appmap-js
with:
path: |
submodules/appmap-js/node_modules
submodules/appmap-js/packages/*/built
key: appmap-js-${{ runner.os }}-${{ hashFiles('submodules/appmap-js/package.json') }}

- name: Build submodules
# TODO: figure out why it doesn't work with cache
# if: steps.cache-appmap-js.outputs.cache-hit != 'true'
env:
PUPPETEER_SKIP_DOWNLOAD: true
run: |
cd submodules/appmap-js
git checkout -- .
yarn
yarn build
chmod +x packages/cli/built/cli.js

- name: Run benchmark
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
SWE_DATASET: ${{ inputs.dataset }}
SWE_SPLIT: ${{ inputs.split }}
SWE_FILTER: ${{ inputs.filter }}
SWE_RETRIES: ${{ inputs.retries }}
run: |
source /usr/share/miniconda/etc/profile.d/conda.sh
conda activate swe-bench
export PYTHONPATH=$PYTHONPATH:$(pwd)
python appmap/solve.py \
--instances ${SWE_DATASET:-princeton-nlp/SWE-bench_Lite} \
--split ${SWE_SPLIT:-dev} \
--filter ${SWE_FILTER:-marshmallow} \
--retries ${SWE_RETRIES:-3} \
--appmap_command $(pwd)/submodules/appmap-js/packages/cli/built/cli.js \
--lint_command "flake8 --extend-ignore=BLK100,W293,E201,E202,E303,E501,E128,E231,C408,F401,C402,E402,C416,E261,E302,D" \
--temp_dir ${{ runner.temp }} \
--num_workers 6 \
--path_conda $(conda info --base) \
--verbose

- name: Run evaluation
env:
SWE_DATASET: ${{ inputs.dataset }}
run: |
mkdir -p logs
source /usr/share/miniconda/etc/profile.d/conda.sh
conda activate swe-bench
export PYTHONPATH=$PYTHONPATH:$(pwd)
python swebench/harness/run_evaluation.py \
--predictions_path predictions.jsonl \
--swe_bench_tasks ${SWE_DATASET:-princeton-nlp/SWE-bench_Lite} \
--log_dir logs \
--testbed ${{ runner.temp }} \
--skip_existing \
--timeout 900 \
--verbose \
--num_processes 8 \
--path_conda $(conda info --base)

- name: Generate AppMap report
env:
SWE_DATASET: ${{ inputs.dataset }}
SWE_SPLIT: ${{ inputs.split }}
run: |
source /usr/share/miniconda/etc/profile.d/conda.sh
conda activate swe-bench
export PYTHONPATH=$PYTHONPATH:$(pwd)
conda info
python appmap/report.py \
--instances ${SWE_DATASET:-princeton-nlp/SWE-bench_Lite} \
--split ${SWE_SPLIT:-dev}

- name: Archive predictions and logs
uses: actions/upload-artifact@v4
with:
name: results
path: |
logs/
predictions.jsonl
results.csv
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -174,3 +174,9 @@ analysis/evaluation/*.csv
analysis/evaluation/*.pdf
data/repos/copies
notebooks/
*.csv
appmap.sh
work
appmap/datasets
logs

4 changes: 4 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[submodule "submodules/appmap-js"]
path = submodules/appmap-js
url = https://github.com/getappmap/appmap-js
branch = feat/apply-command
Empty file added appmap/__init__.py
Empty file.
17 changes: 17 additions & 0 deletions appmap/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
from datasets import DatasetDict, load_dataset, load_from_disk
from pathlib import Path

datasets_dir = Path(__file__).parent / "datasets"


def load_data(dataset_name, split) -> tuple[DatasetDict, str]:
dataset_dir = datasets_dir / dataset_name.replace("/", "__")
dataset = None
if Path(dataset_dir).exists():
dataset = load_from_disk(str(dataset_dir))
else:
dataset = load_dataset(dataset_name)
Path.mkdir(dataset_dir, parents=True)
dataset.save_to_disk(str(dataset_dir))

return dataset[split]
10 changes: 5 additions & 5 deletions appmap/make_appmaps.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
import argparse, glob, itertools, os, tarfile, subprocess

from multiprocessing import Pool, cpu_count
from swebench.harness.constants import MAP_REPO_TO_TEST_FRAMEWORK, PatchType
from swebench.harness.constants import MAP_REPO_TO_TEST_FRAMEWORK
from swebench.harness.context_manager import (
TaskEnvContextManager,
TestbedContextManager,
)
from swebench.harness.utils import get_instances, split_instances, DotDict
from swebench.harness.utils import split_instances, DotDict
from swebench.metrics.getters import get_eval_refs


Expand Down Expand Up @@ -36,9 +36,9 @@ def validate_args(args):

# If value is provided, check that it is valid
if args.timeout is not None and args.timeout < 0:
raise ValueError(f"Timeout must be a positive integer")
raise ValueError("Timeout must be a positive integer")
if args.num_workers is not None and args.num_workers < 1:
raise ValueError(f"Number of workers must be a positive integer")
raise ValueError("Number of workers must be a positive integer")

if not os.path.exists(appmap_bin):
raise ValueError(f"Could not find appmap binary at {args.appmap_bin}")
Expand Down Expand Up @@ -252,7 +252,7 @@ def main(args):
"--num_workers", type=int, default=None, help="(Optional) Number of workers"
)
parser.add_argument(
"--appmap-bin",
"--appmap_bin",
type=str,
help="path to appmap binary",
default="~/.appmap/bin/appmap",
Expand Down
1 change: 0 additions & 1 deletion appmap/navie_issue.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
from datasets import DatasetDict, load_dataset, load_from_disk

from swebench.harness.utils import clone_to
from swebench.metrics.getters import get_eval_refs
from subprocess import PIPE, Popen
import json
from filelock import FileLock
Expand Down
96 changes: 96 additions & 0 deletions appmap/report.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
import argparse
import csv
import os

from swebench import get_model_report
from appmap.data import load_data


def main(predictions, instances, log_dir, model, split, save_results, verbose, output):
report = get_model_report(
model=model,
predictions_path=os.path.abspath(predictions),
swe_bench_tasks=instances,
log_dir=os.path.join(log_dir, model),
verbose=verbose,
)

for k, v in report.items():
print(f"{k}: {len(v)}")

if save_results:
dataset = load_data(instances, split)
write_csv_report(
report,
dataset,
split,
output,
)


def write_csv_report(report_map, dataset, split, output_csv_path):
# Prepare CSV headers
headers = ["instance_id", "split"] + [
key for key in report_map.keys() if key != "no_generation"
]

all_preds = set()
for ids in report_map.values():
all_preds.update(ids)

# Write to CSV
with open(output_csv_path, "w", newline="") as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=headers)
writer.writeheader()
for instance in dataset.to_list():
if instance["instance_id"] not in all_preds:
continue
row = {"instance_id": instance["instance_id"], "split": split}
for category in headers[len(row) :]:
row[category] = instance["instance_id"] in report_map.get(category, [])
writer.writerow(row)

print(f"Wrote {len(all_preds)} predictions to {output_csv_path}")


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--predictions",
type=str,
default="predictions.jsonl",
help="Path to predictions file",
)
parser.add_argument(
"--instances",
type=str,
help="huggingface name of task instances dataset",
default="princeton-nlp/SWE-bench_Lite",
)
parser.add_argument(
"--log_dir", type=str, help="Path to log directory", default="logs"
)
parser.add_argument(
"--model",
type=str,
default="navie",
help="Name of folder containing model evaluation results (e.g. '20240402_sweagent_gpt4)",
)
parser.add_argument(
"--split",
type=str,
default="test",
help="Name of split to get evaluation results for (should be parent folder, e.g. 'test', 'dev')",
choices=["test", "dev"],
)
parser.add_argument(
"--save_results", default=True, action="store_true", help="Save results to file"
)
parser.add_argument(
"--verbose", action="store_true", help="Show intermediate messages"
)
parser.add_argument(
"--output", type=str, default="results.csv", help="Path to output file"
)
args = parser.parse_args()
main(**vars(args))
Loading
Loading