-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace memmap usage with direct file I/O #82
Conversation
Previously, the library made extensive usage of memory-map to copy the contents of inputs & outputs to & from binary files for compatibility with the SNAPHU executable. Memory-mapping is convenient for copying 2-D blocks of data to ensure that we don't run out of memory while copying entire large datasets. However, there seems to be no interface in NumPy or Python's `mmap` library to ensure that the memory-mapped file is safely closed (at least prior to Python 3.13 -- see python/cpython#78502). This may cause issues when unwrapping a large series of interferograms in a single process, potentially leading to the number of open file descriptors exceeding the system's resource limits. In this update, we replace usage of `numpy.memmap` with direct file I/O. In order to avoid undue complexity (as well as potential performance issues due to buffered file access) we now read/write batches of data that span all columns of the input/output 2-D datasets, rather than operating on 2-D sub-blocks of data. This new approach gives full control over when files are opened and closed, in order to avoid leaking resources.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #82 +/- ##
==========================================
+ Coverage 93.44% 93.98% +0.54%
==========================================
Files 9 9
Lines 473 449 -24
==========================================
- Hits 442 422 -20
+ Misses 31 27 -4 ☔ View full report in Codecov by Sentry. |
Add a wrapper for `tempfile.mkstemp` that ensures that open files are properly closed. Replace usage of `tempfile.mkstemp` with the wrapper function.
shape = (slice_.stop - slice_.start,) + dataset.shape[1:] | ||
size = np.prod(shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is assuming 2d only? (Which seems fine, but I wanted to check you didn't picture otherwise)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I believe it should work for datasets of any rank >= 1. The unit tests include some 1-D cases. I didn't explicitly test a 3-D (or greater) dataset but it should work. Nothing in this logic inherently assumes 2-D AFAICT.
file, filename = mkstemp(dir=dir_, prefix=prefix, suffix=suffix) | ||
os.close(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://stackoverflow.com/questions/27680807/mkstemp-is-it-safe-to-close-descriptor-and-reopen-it-again just wanna check- this problem doesn't apply to us here, does it? I think it would be the C version of mkstemp, but I haven't used that over TemporaryDirectory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't an issue for us because we're not using mkstemp
for security reasons in an adversarial environment.
We're just using mkstemp
to ensure that any new scratch files are guaranteed to have unique names so that multiple snaphu-py processes running in parallel with the same scratch directory don't stomp on each other. That property is still guaranteed even if we close the file descriptor after creating the file, so long as all snaphu-py processes always use mkstemp
.
Note that, prior to this PR, we were already ignoring the returned file descriptor and already re-opening the file in a subprocess.
Of course, if a user deletes or overwrites some scratch files created by snaphu-py while it's running, then we'd have a problem. But there's no way for me to protect against that kind of malicious misuse, even if I keep the file descriptor open.
Previously, the library made extensive usage of memory-mapping to copy the contents of inputs & outputs to & from binary files for compatibility with the SNAPHU executable. Memory-mapping is convenient for copying 2-D blocks of data to ensure that we don't run out of available memory while copying entire large datasets.
However, there seems to be no interface in NumPy or in Python's
mmap
library to ensure that the memory-mapped file is safely closed (at least prior to Python 3.13 -- see python/cpython#78502). This may cause issues when unwrapping a large series of interferograms in a single process, potentially leading to the number of open file descriptors exceeding the system's resource limits.In this update, we replace usage of
numpy.memmap
with direct file I/O. In order to avoid undue complexity (as well as potential performance issues due to buffered file access) we now read/write contiguous batches of data that span all columns of the input/output datasets, rather than operating on 2-D sub-blocks of data. This new approach gives full control over when files are opened and closed, in order to avoid leaking resources.TODO: unit tests