Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace memmap usage with direct file I/O #82

Merged
merged 3 commits into from
Sep 16, 2024

Conversation

gmgunter
Copy link
Member

@gmgunter gmgunter commented Sep 13, 2024

Previously, the library made extensive usage of memory-mapping to copy the contents of inputs & outputs to & from binary files for compatibility with the SNAPHU executable. Memory-mapping is convenient for copying 2-D blocks of data to ensure that we don't run out of available memory while copying entire large datasets.

However, there seems to be no interface in NumPy or in Python's mmap library to ensure that the memory-mapped file is safely closed (at least prior to Python 3.13 -- see python/cpython#78502). This may cause issues when unwrapping a large series of interferograms in a single process, potentially leading to the number of open file descriptors exceeding the system's resource limits.

In this update, we replace usage of numpy.memmap with direct file I/O. In order to avoid undue complexity (as well as potential performance issues due to buffered file access) we now read/write contiguous batches of data that span all columns of the input/output datasets, rather than operating on 2-D sub-blocks of data. This new approach gives full control over when files are opened and closed, in order to avoid leaking resources.

TODO: unit tests

Previously, the library made extensive usage of memory-map to copy the
contents of inputs & outputs to & from binary files for compatibility
with the SNAPHU executable. Memory-mapping is convenient for copying 2-D
blocks of data to ensure that we don't run out of memory while copying
entire large datasets.

However, there seems to be no interface in NumPy or Python's `mmap`
library to ensure that the memory-mapped file is safely closed (at least
prior to Python 3.13 -- see
python/cpython#78502). This may cause issues
when unwrapping a large series of interferograms in a single process,
potentially leading to the number of open file descriptors exceeding the
system's resource limits.

In this update, we replace usage of `numpy.memmap` with direct file I/O.
In order to avoid undue complexity (as well as potential performance
issues due to buffered file access) we now read/write batches of data
that span all columns of the input/output 2-D datasets, rather than
operating on 2-D sub-blocks of data. This new approach gives full
control over when files are opened and closed, in order to avoid leaking
resources.
@gmgunter gmgunter marked this pull request as draft September 13, 2024 09:16
Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 97.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 93.98%. Comparing base (b21dd68) to head (4198e28).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/snaphu/_unwrap.py 87.50% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #82      +/-   ##
==========================================
+ Coverage   93.44%   93.98%   +0.54%     
==========================================
  Files           9        9              
  Lines         473      449      -24     
==========================================
- Hits          442      422      -20     
+ Misses         31       27       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gmgunter gmgunter marked this pull request as ready for review September 15, 2024 05:09
Add a wrapper for `tempfile.mkstemp` that ensures that open files are
properly closed. Replace usage of `tempfile.mkstemp` with the wrapper
function.
@gmgunter gmgunter merged commit 621cfe5 into isce-framework:main Sep 16, 2024
13 checks passed
@gmgunter gmgunter deleted the remove-memmap branch September 16, 2024 16:07
Comment on lines +199 to +200
shape = (slice_.stop - slice_.start,) + dataset.shape[1:]
size = np.prod(shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming 2d only? (Which seems fine, but I wanted to check you didn't picture otherwise)

Copy link
Member Author

@gmgunter gmgunter Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I believe it should work for datasets of any rank >= 1. The unit tests include some 1-D cases. I didn't explicitly test a 3-D (or greater) dataset but it should work. Nothing in this logic inherently assumes 2-D AFAICT.

Comment on lines +54 to +55
file, filename = mkstemp(dir=dir_, prefix=prefix, suffix=suffix)
os.close(file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://stackoverflow.com/questions/27680807/mkstemp-is-it-safe-to-close-descriptor-and-reopen-it-again just wanna check- this problem doesn't apply to us here, does it? I think it would be the C version of mkstemp, but I haven't used that over TemporaryDirectory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't an issue for us because we're not using mkstemp for security reasons in an adversarial environment.

We're just using mkstemp to ensure that any new scratch files are guaranteed to have unique names so that multiple snaphu-py processes running in parallel with the same scratch directory don't stomp on each other. That property is still guaranteed even if we close the file descriptor after creating the file, so long as all snaphu-py processes always use mkstemp.

Note that, prior to this PR, we were already ignoring the returned file descriptor and already re-opening the file in a subprocess.

Of course, if a user deletes or overwrites some scratch files created by snaphu-py while it's running, then we'd have a problem. But there's no way for me to protect against that kind of malicious misuse, even if I keep the file descriptor open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants