Enable in oss (#124031)

Summary: Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting: ``` # Take how many of the top triton kernels to benchmark epilogue max_epilogue_benchmarked_choices = 3 ``` There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent. Inference: <img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c"> Training: <img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788"> X-link: pytorch/pytorch#124031 Approved by: https://github.com/Chillee, https://github.com/shunting314 ghstack dependencies: #124030, #122642, #123229, #122825 Reviewed By: jeanschmidt Differential Revision: D56379580 Pulled By: eellison fbshipit-source-id: 8e11d1636a2f48bb8c8d0380dad3a2ac76294422
pytorch · Apr 22, 2024 · eae910e · eae910e
1 parent d9d9337
commit eae910e
Show file tree

Hide file tree

Showing 2 changed files with 36 additions and 0 deletions.
diff --git a/userbenchmark/dynamo/dynamobench/common.py b/userbenchmark/dynamo/dynamobench/common.py
@@ -2578,6 +2578,14 @@ def record_status(accuracy_status, dynamo_start_stats):
                 # E.g., the output order might not match, None might be part of output, etc.
 
             try:
+                if self.args.training and self.args.amp:
+                    if process_fn := self.get_output_amp_train_process_func.get(
+                        name, None
+                    ):
+                        correct_result = process_fn(correct_result)
+                        new_result = process_fn(new_result)
+                        fp64_outputs = process_fn(fp64_outputs)
+
                 if not same(
                     correct_result,
                     new_result,

diff --git a/userbenchmark/dynamo/dynamobench/torchbench.py b/userbenchmark/dynamo/dynamobench/torchbench.py
@@ -88,6 +88,30 @@ def maybe_list_to_set(obj):
     return maybe_list_to_set(data)
 
 
+def process_hf_reformer_output(out):
+    assert isinstance(out, list)
+    # second output is unstable
+    return [elem for i, elem in enumerate(out) if i != 1]
+
+
+def process_hf_whisper_output(out):
+    out_ret = []
+    for i, elem in enumerate(out):
+        if i == 0:
+            assert isinstance(elem, dict)
+            out_ret.append({k: v for k, v in elem.items() if k != "logits"})
+        elif i != 1:
+            out_ret.append(elem)
+
+    return out_ret
+
+
+process_train_model_output = {
+    "hf_Reformer": process_hf_reformer_output,
+    "hf_Whisper": process_hf_whisper_output,
+}
+
+
 class TorchBenchmarkRunner(BenchmarkRunner):
     def __init__(self):
         super().__init__()
@@ -142,6 +166,10 @@ def very_slow_models(self):
     def non_deterministic_models(self):
         return self._config["non_deterministic"]
 
+    @property
+    def get_output_amp_train_process_func(self):
+        return process_train_model_output
+
     @property
     def skip_not_suitable_for_training_models(self):
         return self._skip["test"]["training"]