Skip to content

Commit

Permalink
rerun benchmarks after fix and update results
Browse files Browse the repository at this point in the history
  • Loading branch information
gmickel committed Aug 16, 2024
1 parent 22442a9 commit cc5c791
Show file tree
Hide file tree
Showing 5 changed files with 2,511 additions and 11 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -431,9 +431,9 @@ CodeWhisper's performance has been evaluated across different models using the E

| Model | Tests Passed | Time (s) | Cost ($) | Command |
| -------------------------- | ------------ | -------- | -------- | ------------------------------------------------------------------------------ |
| claude-3-5-sonnet-20240620 | 80.27% | 1619.49 | 3.4000 | `./benchmark/run_benchmark.sh --workers 5 --no-plan` |
| claude-3-5-sonnet-20240620 | 80.26% | 1619.49 | 3.4000 | `./benchmark/run_benchmark.sh --workers 5 --no-plan` |
| gpt-4o-2024-08-06 | 81.51% | 986.68 | 1.6800 | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model gpt-4o-2024-08-06` |
| deepseek-coder | 76.89% | 5850.58 | 0.0000\* | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model deepseek-coder` |
| deepseek-coder | 76.98% | 5850.58 | 0.0000\* | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model deepseek-coder` |

\*The cost calculation was not working properly for this benchmark run.

Expand Down
4 changes: 2 additions & 2 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ CodeWhisper's performance has been evaluated across different models using the E

| Model | Tests Passed | Time (s) | Cost ($) | Command |
| -------------------------- | ------------ | -------- | -------- | ------------------------------------------------------------------------------ |
| claude-3-5-sonnet-20240620 | 80.27% | 1619.49 | 3.4000 | `./benchmark/run_benchmark.sh --workers 5 --no-plan` |
| claude-3-5-sonnet-20240620 | 80.26% | 1619.49 | 3.4000 | `./benchmark/run_benchmark.sh --workers 5 --no-plan` |
| gpt-4o-2024-08-06 | 81.51% | 986.68 | 1.6800 | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model gpt-4o-2024-08-06` |
| deepseek-coder | 76.89% | 5850.58 | 0.0000\* | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model deepseek-coder` |
| deepseek-coder | 76.98% | 5850.58 | 0.0000\* | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model deepseek-coder` |

\*The cost calculation was not working properly for this benchmark run.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@
- **Total time:** 1619.49 seconds
- **Total cost:** $3.4000
- **Passed exercises:** 104/133 (78.20%)
- **Total tests passed:** 22601/28155 (80.27%)
- **Total tests passed:** 3595/4479 (80.26%)

## Detailed Results

### 2. acronym

Expand All @@ -17,10 +19,6 @@
- **Failed tests:**
- test_underscore_emphasis

# CodeWhisper Benchmark Report

## Detailed Results

### 1. accumulate

- **Time taken:** 9.86 seconds
Expand Down Expand Up @@ -701,7 +699,7 @@
- **Cost:** $0.0400
- **Mode used:** diff
- **Model used:** claude-3-5-sonnet-20240620
- **Tests passed:** 19544/24447 (79.94%)
- **Tests passed:** 58/73 (79.45%)
- **Failed tests:**
- test_can_find_path_from_nodes_other_than_x
- test_can_find_path_not_involving_root
Expand Down Expand Up @@ -1176,4 +1174,3 @@
- **Failed tests:**
- test_resident_who_drinks_water
- test_resident_who_owns_zebra

Loading

0 comments on commit cc5c791

Please sign in to comment.