rerun benchmarks after fix and update results

gmickel · Aug 16, 2024 · cc5c791 · cc5c791
1 parent 22442a9
commit cc5c791
Show file tree

Hide file tree

Showing 5 changed files with 2,511 additions and 11 deletions.
diff --git a/README.md b/README.md
@@ -431,9 +431,9 @@ CodeWhisper's performance has been evaluated across different models using the E
 
 | Model                      | Tests Passed | Time (s) | Cost ($) | Command                                                                        |
 | -------------------------- | ------------ | -------- | -------- | ------------------------------------------------------------------------------ |
-| claude-3-5-sonnet-20240620 | 80.27%       | 1619.49  | 3.4000   | `./benchmark/run_benchmark.sh --workers 5 --no-plan`                           |
+| claude-3-5-sonnet-20240620 | 80.26%       | 1619.49  | 3.4000   | `./benchmark/run_benchmark.sh --workers 5 --no-plan`                           |
 | gpt-4o-2024-08-06          | 81.51%       | 986.68   | 1.6800   | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model gpt-4o-2024-08-06` |
-| deepseek-coder             | 76.89%       | 5850.58  | 0.0000\* | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model deepseek-coder`    |
+| deepseek-coder             | 76.98%       | 5850.58  | 0.0000\* | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model deepseek-coder`    |
 
 \*The cost calculation was not working properly for this benchmark run.
 

diff --git a/benchmark/README.md b/benchmark/README.md
@@ -13,9 +13,9 @@ CodeWhisper's performance has been evaluated across different models using the E
 
 | Model                      | Tests Passed | Time (s) | Cost ($) | Command                                                                        |
 | -------------------------- | ------------ | -------- | -------- | ------------------------------------------------------------------------------ |
-| claude-3-5-sonnet-20240620 | 80.27%       | 1619.49  | 3.4000   | `./benchmark/run_benchmark.sh --workers 5 --no-plan`                           |
+| claude-3-5-sonnet-20240620 | 80.26%       | 1619.49  | 3.4000   | `./benchmark/run_benchmark.sh --workers 5 --no-plan`                           |
 | gpt-4o-2024-08-06          | 81.51%       | 986.68   | 1.6800   | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model gpt-4o-2024-08-06` |
-| deepseek-coder             | 76.89%       | 5850.58  | 0.0000\* | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model deepseek-coder`    |
+| deepseek-coder             | 76.98%       | 5850.58  | 0.0000\* | `./benchmark/run_benchmark.sh --workers 5 --no-plan --model deepseek-coder`    |
 
 \*The cost calculation was not working properly for this benchmark run.
 

diff --git a/benchmark/reports/benchmark_report_claude_sonnet_diff_reference.md b/benchmark/reports/benchmark_report_claude_sonnet_diff_reference.md
@@ -5,7 +5,9 @@
 - **Total time:** 1619.49 seconds
 - **Total cost:** $3.4000
 - **Passed exercises:** 104/133 (78.20%)
-- **Total tests passed:** 22601/28155 (80.27%)
+- **Total tests passed:** 3595/4479 (80.26%)
+
+## Detailed Results
 
 ### 2. acronym
 
@@ -17,10 +19,6 @@
 - **Failed tests:**
   - test_underscore_emphasis
 
-# CodeWhisper Benchmark Report
-
-## Detailed Results
-
 ### 1. accumulate
 
 - **Time taken:** 9.86 seconds
@@ -701,7 +699,7 @@
 - **Cost:** $0.0400
 - **Mode used:** diff
 - **Model used:** claude-3-5-sonnet-20240620
-- **Tests passed:** 19544/24447 (79.94%)
+- **Tests passed:** 58/73 (79.45%)
 - **Failed tests:**
   - test_can_find_path_from_nodes_other_than_x
   - test_can_find_path_not_involving_root
@@ -1176,4 +1174,3 @@
 - **Failed tests:**
   - test_resident_who_drinks_water
   - test_resident_who_owns_zebra
-