Add CUDA Memory Leak error bisection #2205

xuzhao9 · 2024-03-22T16:56:01Z

Support CUDA Memory Leak in bisection.

Test plan:
https://github.com/pytorch/benchmark/actions/runs/8405455144

Start hash: 90fdee15be285a6a54587f44e9f3dcfed8e0efd0
End hash: cfaed59ce73871658958bb3d0c08d820c2595e62
Userbenchmark: test_bench
Userbenchmark args: -m sam_fast -d cuda -t eval --memleak

xuzhao9 · 2024-03-24T13:36:00Z

Result:

{
  "target_repo": "pytorch",
  "start": "6b5259e50704aede43c87fed33f64224f9047087",
  "start_version": "2.4.0.dev20240323+cu121",
  "end": "cc0cadaf4c76c91d19474a1a512c9bc31e2c8602",
  "end_version": "2.4.0.dev20240323+cu121",
  "result": [
    {
      "commit1": "91ead3eae4c",
      "commit1_time": "2024-03-21 01:56:42 +0000",
      "commit1_digest": {
        "name": "test_bench",
        "environ": {
          "pytorch_git_version": "91ead3eae4cd6cbf50fe7a7b4a2f9f35302bc9b2",
          "pytorch_version": "2.4.0a0+git91ead3e",
          "device": "NVIDIA A100-SXM4-40GB",
          "git_commit_hash": "91ead3eae4cd6cbf50fe7a7b4a2f9f35302bc9b2"
        },
        "metrics": {
          "model=sam_fast, test=eval, device=cuda, bs=None, extra_args=['--memleak'], metric=memleak": "False"
        }
      },
      "commit2": "e9dcda5cba9",
      "commit2_time": "2024-03-21 01:57:08 +0000",
      "commit2_digest": {
        "name": "test_bench",
        "environ": {
          "pytorch_git_version": "e9dcda5cba92884be6432cf65a777b8ed708e3d6",
          "pytorch_version": "2.4.0a0+gite9dcda5",
          "device": "NVIDIA A100-SXM4-40GB",
          "git_commit_hash": "e9dcda5cba92884be6432cf65a777b8ed708e3d6"
        },
        "metrics": {
          "model=sam_fast, test=eval, device=cuda, bs=None, extra_args=['--memleak'], metric=memleak": "True"
        }
      }
    }
  ]
}

xuzhao9 · 2024-03-24T13:48:51Z

@atalman This is an example use case of how to use Torchbench bisection to automatically diagnose CUDA memleak errors.

facebook-github-bot · 2024-03-25T15:04:05Z

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-03-25T18:40:22Z

@xuzhao9 merged this pull request in c4098d2.

Simple fix on extended configs

38bebdc

xuzhao9 had a problem deploying to docker-s3-upload March 22, 2024 16:56 — with GitHub Actions Error

facebook-github-bot added the cla signed label Mar 22, 2024

Add memory leak test to test_bench

e51bca6

xuzhao9 had a problem deploying to docker-s3-upload March 22, 2024 17:12 — with GitHub Actions Error

Bugfix

98e3b1f

xuzhao9 had a problem deploying to docker-s3-upload March 22, 2024 17:16 — with GitHub Actions Error

Fix memleak test

0802d52

xuzhao9 had a problem deploying to docker-s3-upload March 22, 2024 17:36 — with GitHub Actions Failure

Fix output

6e93e00

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 00:00 — with GitHub Actions Error

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 00:00 — with GitHub Actions Failure

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 00:01 — with GitHub Actions Failure

Install torchbench before running the bisection

bef7c75

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 01:58 — with GitHub Actions Error

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 01:59 — with GitHub Actions Error

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 02:00 — with GitHub Actions Failure

xuzhao9 changed the title ~~Simple fix on extended configs~~ Add CUDA Memory Leak error bisection Mar 23, 2024

Move install earlier

e2c4180

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 03:09 — with GitHub Actions Failure

Another fix

0033294

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 14:15 — with GitHub Actions Failure

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 14:15 — with GitHub Actions Error

Install dependencies

3402011

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 21:56 — with GitHub Actions Error

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 21:56 — with GitHub Actions Failure

Add docker build to gcp

7822331

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 22:00 — with GitHub Actions Error

Another fix

6346979

xuzhao9 had a problem deploying to docker-s3-upload March 23, 2024 22:01 — with GitHub Actions Failure

Add steps to build triton

6b56531

xuzhao9 had a problem deploying to docker-s3-upload March 24, 2024 00:16 — with GitHub Actions Failure

xuzhao9 temporarily deployed to docker-s3-upload March 24, 2024 00:17 — with GitHub Actions Inactive

xuzhao9 mentioned this pull request Mar 24, 2024

[TorchBench] Commit e9dcda5cba9 causes CUDA Memleak on model sam_fast pytorch/pytorch#122575

Closed

xuzhao9 requested review from aaronenyeshi and atalman March 24, 2024 13:48

Fix the dockerfile

6d91abf

xuzhao9 had a problem deploying to docker-s3-upload March 24, 2024 13:49 — with GitHub Actions Failure

aaronenyeshi approved these changes Mar 25, 2024

View reviewed changes

facebook-github-bot closed this in c4098d2 Mar 25, 2024

facebook-github-bot added the Merged label Mar 25, 2024

xuzhao9 deleted the xz9/fix-a100-bisect branch March 28, 2024 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA Memory Leak error bisection #2205

Add CUDA Memory Leak error bisection #2205

xuzhao9 commented Mar 22, 2024 •

edited

Loading

xuzhao9 commented Mar 24, 2024

xuzhao9 commented Mar 24, 2024

facebook-github-bot commented Mar 25, 2024

facebook-github-bot commented Mar 25, 2024

Add CUDA Memory Leak error bisection #2205

Add CUDA Memory Leak error bisection #2205

Conversation

xuzhao9 commented Mar 22, 2024 • edited Loading

xuzhao9 commented Mar 24, 2024

xuzhao9 commented Mar 24, 2024

facebook-github-bot commented Mar 25, 2024

facebook-github-bot commented Mar 25, 2024

xuzhao9 commented Mar 22, 2024 •

edited

Loading