GPU resources not being freed during long pytest suite #10296

AngledLuffa · 2022-09-14T03:53:06Z

AngledLuffa
Sep 14, 2022

I have a pytest suite which takes roughly 10 minutes and allocates many objects on the GPU using pytorch. In some cases, they are apparently freed after a test finishes, but in others the objects persist. The result is that by the end of the test suite, so many pytorch objects are still on the GPU that even simple operations run out of memory.

Is there a good solution for this? I was able to reuse some objects by turning them into fixtures, saving enough memory to get the test suite to run, but even when the objects are module or class scoped fixtures they seem to stick around on the GPU quite often.

I also tried

del big_object
torch.cuda.empty_cache()

but that didn't free anything as far as I could tell.

Is this an interaction between pytorch and pytest, or are the objects themselves not being deleted?

For reference, the test suite is here

Thanks in advance

nicoddemus · 2022-09-14T10:16:52Z

nicoddemus
Sep 14, 2022
Maintainer

but even when the objects are module or class scoped fixtures they seem to stick around on the GPU quite often

Yeah the objects returned by the fixtures are cached by pytest while they are needed.

Is this an interaction between pytorch and pytest, or are the objects themselves not being deleted?

pytest does not interact explicitly with PyTorch, so my guess is that the objects are being cached in long-lived fixtures (session/module/class) or by your own code.

1 reply

AngledLuffa Sep 14, 2022
Author

s there a good way to suggest to pytest that a module or class scoped fixture is no longer needed? If I do this:

stanfordnlp/stanza@1f60beb

The large item that is created as a module scoped fixture is apparently not cleaned up by the time the rest of the test suite is run, at least according to nvidia-smi. If I put the test inside a class and make the fixture class scoped, that doesn't help either. In this case, the fixture is only used once, so there was no actual benefit to making it a fixture. However, there are other times when I create an expensive item which I use several times, and it uses up the GPU on our test machine to have several large fixtures kept alive between tests.

This is with pytest version 7.0.1

Perhaps there is some weird garbage collection interaction? But that doesn't quite seem right either, because I can see that the object containing all of the GPU objects has its __del__() method called at the end of the module or class's tests.

Perhaps this is a limitation of pytorch things being contained in other things not being cleaned up properly? If I google search for that, I do find a few threads on manually deleting things to make sure the tensors are freed:

https://discuss.pytorch.org/t/deleting-tensors-in-a-list-class-or-tuple-does-not-delete-the-original-tensor/45743

This suggests a call to empty_cache() or running the items in a no_grad context, but those ideas don't make a difference if I put them after a yield variant of the fixture:

https://discuss.pytorch.org/t/how-to-delete-a-tensor-in-gpu-to-free-up-memory/48879/12

It's possible this has nothing to do with pytest, but it's weird that making the large object into a fixture causes the memory to grow out of control. Can you think of anything that would cause that to happen, or any way to ensure that the items are freed correctly?

nicoddemus · 2022-09-15T10:56:35Z

nicoddemus
Sep 15, 2022
Maintainer

The large item that is created as a module scoped fixture is apparently not cleaned up by the time the rest of the test suite is run, at least according to nvidia-smi.

Strange, because the fixture is cleaned up when the last test in a module executes (regardless if that test uses the fixture or not). Example:

# content of test_1.py
import pytest

@pytest.fixture(scope="module")
def fix():
    print("fix setup")
    yield
    print("fix teardown")

def test_a(fix):
    pass

def test_b():
    pass


# content of test_2.py
def test_c():
    pass

def test_d():
    pass

λ pytest -s --no-header
======================== test session starts ========================
collected 4 items

test_1.py fix setup
..fix teardown

test_2.py ..

========================= 4 passed in 0.02s =========================

I'm not familiar with your code, but don't you have to explicitly call a close() or destroy() method to cleanup resources?

It's possible this has nothing to do with pytest, but it's weird that making the large object into a fixture causes the memory to grow out of control.

As far as I can tell, if pytest is correctly calling the "teardown" portion of the fixture, then pytest is doing the right thing; the fixture should cleanup after itself at that point.

Sorry if I can't be of more help than that. 😕

3 replies

AngledLuffa Sep 15, 2022
Author

I had seen that the things were being freed at the proper time by overriding their __del__() method, but for whatever reason the GPU usage gets quite bloated anyway. Alright, I was hoping there would be some already known trick for dealing with this situation. I will try to come up with a large, simple demonstration which either 1) shows me what needs to be done to fix the issue on our end or 2) suggests an improvement to either pytest or pytorch's handling of large objects. Thanks for your time!

AngledLuffa Sep 21, 2022
Author

I'm starting to wonder if the problem is just fragmentation. Still, if anyone else comes up with an insight on how to make pytorch work better as part of a long pytest suite, I'd appreciate an update

RonnyPfannschmidt Sep 21, 2022
Maintainer

btw, an important detail to consider is that __del__ is always the wrong tool for resource cleanup - it should in fact be a warning if not error if it still finds resources owned

stevenmanton · 2024-10-23T21:47:22Z

stevenmanton
Oct 23, 2024

@AngledLuffa I think this discussion might answer your question of why the models persist on the GPU despite the scope and garbage collection: #10387

0 replies

AngledLuffa · 2024-10-24T06:57:22Z

AngledLuffa
Oct 24, 2024
Author

Thanks for following up! I will give that a try next time we run out of GPU. We had upgraded our test machine for unrelated reasons, and that was after I had condensed the tests enough to fit in the previous GPU, so this might not be an issue until the tests get bigger again.

0 replies

AngledLuffa · 2024-10-28T20:29:50Z

AngledLuffa
Oct 28, 2024
Author

Again, thanks. I ran into this when adding a new feature & a test for that feature, and gc.collect() got the tests running again.

stanfordnlp/stanza@c5cb489

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU resources not being freed during long pytest suite #10296

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

GPU resources not being freed during long pytest suite #10296

AngledLuffa Sep 14, 2022

Replies: 5 comments · 4 replies

nicoddemus Sep 14, 2022 Maintainer

AngledLuffa Sep 14, 2022 Author

nicoddemus Sep 15, 2022 Maintainer

AngledLuffa Sep 15, 2022 Author

AngledLuffa Sep 21, 2022 Author

RonnyPfannschmidt Sep 21, 2022 Maintainer

stevenmanton Oct 23, 2024

AngledLuffa Oct 24, 2024 Author

AngledLuffa Oct 28, 2024 Author

AngledLuffa
Sep 14, 2022

Replies: 5 comments 4 replies

nicoddemus
Sep 14, 2022
Maintainer

AngledLuffa Sep 14, 2022
Author

nicoddemus
Sep 15, 2022
Maintainer

AngledLuffa Sep 15, 2022
Author

AngledLuffa Sep 21, 2022
Author

RonnyPfannschmidt Sep 21, 2022
Maintainer

stevenmanton
Oct 23, 2024

AngledLuffa
Oct 24, 2024
Author

AngledLuffa
Oct 28, 2024
Author