Replace memmap usage with direct file I/O #82

gmgunter · 2024-09-13T09:16:45Z

Previously, the library made extensive usage of memory-mapping to copy the contents of inputs & outputs to & from binary files for compatibility with the SNAPHU executable. Memory-mapping is convenient for copying 2-D blocks of data to ensure that we don't run out of available memory while copying entire large datasets.

However, there seems to be no interface in NumPy or in Python's mmap library to ensure that the memory-mapped file is safely closed (at least prior to Python 3.13 -- see python/cpython#78502). This may cause issues when unwrapping a large series of interferograms in a single process, potentially leading to the number of open file descriptors exceeding the system's resource limits.

In this update, we replace usage of numpy.memmap with direct file I/O. In order to avoid undue complexity (as well as potential performance issues due to buffered file access) we now read/write contiguous batches of data that span all columns of the input/output datasets, rather than operating on 2-D sub-blocks of data. This new approach gives full control over when files are opened and closed, in order to avoid leaking resources.

~~TODO: unit tests~~

Previously, the library made extensive usage of memory-map to copy the contents of inputs & outputs to & from binary files for compatibility with the SNAPHU executable. Memory-mapping is convenient for copying 2-D blocks of data to ensure that we don't run out of memory while copying entire large datasets. However, there seems to be no interface in NumPy or Python's `mmap` library to ensure that the memory-mapped file is safely closed (at least prior to Python 3.13 -- see python/cpython#78502). This may cause issues when unwrapping a large series of interferograms in a single process, potentially leading to the number of open file descriptors exceeding the system's resource limits. In this update, we replace usage of `numpy.memmap` with direct file I/O. In order to avoid undue complexity (as well as potential performance issues due to buffered file access) we now read/write batches of data that span all columns of the input/output 2-D datasets, rather than operating on 2-D sub-blocks of data. This new approach gives full control over when files are opened and closed, in order to avoid leaking resources.

codecov · 2024-09-13T09:18:23Z

Codecov Report

Attention: Patch coverage is 97.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 93.98%. Comparing base (b21dd68) to head (4198e28).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/snaphu/_unwrap.py	87.50%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #82      +/-   ##
==========================================
+ Coverage   93.44%   93.98%   +0.54%     
==========================================
  Files           9        9              
  Lines         473      449      -24     
==========================================
- Hits          442      422      -20     
+ Misses         31       27       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Add a wrapper for `tempfile.mkstemp` that ensures that open files are properly closed. Replace usage of `tempfile.mkstemp` with the wrapper function.

scottstanie · 2024-09-16T11:40:47Z

src/snaphu/_util.py

+        shape = (slice_.stop - slice_.start,) + dataset.shape[1:]
+        size = np.prod(shape)


This is assuming 2d only? (Which seems fine, but I wanted to check you didn't picture otherwise)

No, I believe it should work for datasets of any rank >= 1. ~~The unit tests include some 1-D cases.~~ I didn't explicitly test a 3-D (or greater) dataset but it should work. Nothing in this logic inherently assumes 2-D AFAICT.

scottstanie · 2024-09-16T11:44:06Z

src/snaphu/_util.py

+    file, filename = mkstemp(dir=dir_, prefix=prefix, suffix=suffix)
+    os.close(file)


https://stackoverflow.com/questions/27680807/mkstemp-is-it-safe-to-close-descriptor-and-reopen-it-again just wanna check- this problem doesn't apply to us here, does it? I think it would be the C version of mkstemp, but I haven't used that over TemporaryDirectory

This isn't an issue for us because we're not using mkstemp for security reasons in an adversarial environment.

We're just using mkstemp to ensure that any new scratch files are guaranteed to have unique names so that multiple snaphu-py processes running in parallel with the same scratch directory don't stomp on each other. That property is still guaranteed even if we close the file descriptor after creating the file, so long as all snaphu-py processes always use mkstemp.

Note that, prior to this PR, we were already ignoring the returned file descriptor and already re-opening the file in a subprocess.

Of course, if a user deletes or overwrites some scratch files created by snaphu-py while it's running, then we'd have a problem. But there's no way for me to protect against that kind of malicious misuse, even if I keep the file descriptor open.

gmgunter marked this pull request as draft September 13, 2024 09:16

gmgunter mentioned this pull request Sep 13, 2024

Ensure memory-mapped files are safely closed #81

Closed

Add unit tests

0c75df4

gmgunter marked this pull request as ready for review September 15, 2024 05:09

Close files opened by tempfile.mkstemp

4198e28

Add a wrapper for `tempfile.mkstemp` that ensures that open files are properly closed. Replace usage of `tempfile.mkstemp` with the wrapper function.

gmgunter merged commit 621cfe5 into isce-framework:main Sep 16, 2024
13 checks passed

gmgunter deleted the remove-memmap branch September 16, 2024 16:07

scottstanie reviewed Sep 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace memmap usage with direct file I/O #82

Replace memmap usage with direct file I/O #82

gmgunter commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading

scottstanie Sep 16, 2024

gmgunter Sep 16, 2024 •

edited

Loading

scottstanie Sep 16, 2024

gmgunter Sep 16, 2024

		shape = (slice_.stop - slice_.start,) + dataset.shape[1:]
		size = np.prod(shape)

		file, filename = mkstemp(dir=dir_, prefix=prefix, suffix=suffix)
		os.close(file)

Replace memmap usage with direct file I/O #82

Replace memmap usage with direct file I/O #82

Conversation

gmgunter commented Sep 13, 2024 • edited Loading

codecov bot commented Sep 13, 2024 • edited Loading

Codecov Report

scottstanie Sep 16, 2024

Choose a reason for hiding this comment

gmgunter Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

scottstanie Sep 16, 2024

Choose a reason for hiding this comment

gmgunter Sep 16, 2024

Choose a reason for hiding this comment

gmgunter commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading

gmgunter Sep 16, 2024 •

edited

Loading