-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIN] Parallel tests execution fails because of locked files #2777
Comments
Maybe try to implement pytest' session-level fixture with lock files, in which the launcher will be built only once? |
It is better to fix Triton itself then fix unit tests because this error may happen in real scenario if user attempts to run the same kernel from multiple threads or processes. |
Ok it looks like the problem is with how os.replace is implemented on Windows. It is not atomic and consists of two system calls: SetRenameInformationFile and CloseFile between which if some other process tries to open this file (for reading or writing doesn't matter) it gets a SHARING VIOLATION error. I wrote two small tests that demonstrate this behavior: import os
count = 0
while True:
name = f"{count}.txt"
f = open(name, "w")
f.write("test\n")
f.close()
try:
os.replace(name, "test.txt")
except PermissionError:
print(f"Failed with {name}")
count += 1 from pathlib import Path
count = 0
while True:
try:
line = Path("test.txt").read_text()
except PermissionError:
print(f"Failed {count}")
count += 1 When they run together the second program produces errors that we see in triton cache. |
Also what should be taken into account is that VS Code IDE often opens and locks files because it monitors filesystem for changes. So if you are running the above tests make sure that you are not running VS Code (Code.exe process name). |
If you remove the DLL from the equation (maybe by moving the directory?) do you have this issue with the cached IR and generated device code? |
The problem happens with all cached files, *.json, *.llir, *.spv, *.ttir, *.ttgir, etc. Any of them can potentially trigger this exception. This is an example of test failures when running them on 16 workers:
|
Can we add a lock for |
Yes file locking is one possible approach that we can implement. The problem with it is that adding file locking in just this one place is not enough. We need to also guard read access to cached files with locks because currently exception happens when another process tries to open this file for reading. There is more than one read location and we need to find all of them. |
Interestingly, there are no such errors in this run: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12355194841/job/34478248043. The only PermissionError i noticed is #3019, which is simple to fix. |
These exceptions are pretty rare, they happen about 2-4 times per 10k tests in core test suite when I run them on 16 workers. You used only 2 workers and probably just were lucky this time. |
Right, I am re-running with 16 workers, trying to reproduce. |
Reproduced in https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12361948103.
|
Here you are getting an error when two processes contend on |
No PermissionErrors in the latest run: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/12383767088/job/34567109930 Potentially the following combination of changes helped: |
Describe the bug
Trying to execute multiple unit tests in parallel with xdist
-n X
on Windows leads to failures. It happens most likely because one worker compiles and starts executing a kernel using a launcher DLL (withpyd
extension on windows) from ~/.triton/cache folder while another worker tries to compile the same kernel and write the a launcher DLL into the same folder. On Windows a DLL that is loaded into a process is locked and cannot be modified, so 2nd worker that tries to write a .pyd file gets an IO error and fails. This doesn't happen on Linux because Linux doesn't lock files that are open by running processes.Any ideas on how to reliably solve this are welcome.
Environment details
Triton on any GPU running on Windows.
The text was updated successfully, but these errors were encountered: