Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA Memory Leak error bisection #2205

Closed
wants to merge 14 commits into from
Closed

Conversation

xuzhao9
Copy link
Contributor

@xuzhao9 xuzhao9 commented Mar 22, 2024

Support CUDA Memory Leak in bisection.

Test plan:
https://github.com/pytorch/benchmark/actions/runs/8405455144

Start hash: 90fdee15be285a6a54587f44e9f3dcfed8e0efd0
End hash: cfaed59ce73871658958bb3d0c08d820c2595e62
Userbenchmark: test_bench
Userbenchmark args: -m sam_fast -d cuda -t eval --memleak

@xuzhao9 xuzhao9 changed the title Simple fix on extended configs Add CUDA Memory Leak error bisection Mar 23, 2024
@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Mar 24, 2024

Result:

{
  "target_repo": "pytorch",
  "start": "6b5259e50704aede43c87fed33f64224f9047087",
  "start_version": "2.4.0.dev20240323+cu121",
  "end": "cc0cadaf4c76c91d19474a1a512c9bc31e2c8602",
  "end_version": "2.4.0.dev20240323+cu121",
  "result": [
    {
      "commit1": "91ead3eae4c",
      "commit1_time": "2024-03-21 01:56:42 +0000",
      "commit1_digest": {
        "name": "test_bench",
        "environ": {
          "pytorch_git_version": "91ead3eae4cd6cbf50fe7a7b4a2f9f35302bc9b2",
          "pytorch_version": "2.4.0a0+git91ead3e",
          "device": "NVIDIA A100-SXM4-40GB",
          "git_commit_hash": "91ead3eae4cd6cbf50fe7a7b4a2f9f35302bc9b2"
        },
        "metrics": {
          "model=sam_fast, test=eval, device=cuda, bs=None, extra_args=['--memleak'], metric=memleak": "False"
        }
      },
      "commit2": "e9dcda5cba9",
      "commit2_time": "2024-03-21 01:57:08 +0000",
      "commit2_digest": {
        "name": "test_bench",
        "environ": {
          "pytorch_git_version": "e9dcda5cba92884be6432cf65a777b8ed708e3d6",
          "pytorch_version": "2.4.0a0+gite9dcda5",
          "device": "NVIDIA A100-SXM4-40GB",
          "git_commit_hash": "e9dcda5cba92884be6432cf65a777b8ed708e3d6"
        },
        "metrics": {
          "model=sam_fast, test=eval, device=cuda, bs=None, extra_args=['--memleak'], metric=memleak": "True"
        }
      }
    }
  ]
}

@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Mar 24, 2024

@atalman This is an example use case of how to use Torchbench bisection to automatically diagnose CUDA memleak errors.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 merged this pull request in c4098d2.

@xuzhao9 xuzhao9 deleted the xz9/fix-a100-bisect branch March 28, 2024 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants