Primitives for reading a file in parallel #113
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In our benchmarking, we've run into a recurring issue that it is unreasonably slow to read large files with basis IO functionality (e.g.
TextIO.inputAll
). The obvious alternative would be to usePosix.IO
andPosix.FileSys
, and indeed these are fast sequentially, but they also unfortunately provide no opportunity for parallelism.There is a standard trick to work around the lack of parallelism with POSIX: just
mmap
files into memory and then read from them directly. This is fast sequentially and highly parallel. So, I implemented it in MPL.Library Spec
This patch provides a structure
MPL.File
with this signature:A file of type
MPL.File.t
can be created by passing a path string toopenFile
, which mmaps the file into memory (as private and read-only). Later, you must explicitly close the file withcloseFile
to free up this memory. Otherwise, there will be a space leak.The functions
readChars
andreadWord8s
take a file, an offset, and an output buffer, and copy enough bytes from the file (starting at the given offset) to fill up the output buffer. These can be safely called in parallel, even on overlapping regions of the file, but the output buffers must be disjoint to avoid data races. The one-byte versionsreadChar
andreadWord8
just return the byte at the given offset.Except for the
unsafe
functions, if you attempt to read outside the range of the file, the exceptionSubscript
will be raised, and any operation on a closed file will raiseClosed
. Note that closing a file is not safe for concurrency (e.g., if you close a file while concurrently reading from it, this could crash). This is something that needs to be fixed in the future.Intended Use
Here is a function defined in terms of
MPL.File
that reads a file in parallel. It behaves as though it were purely functional.Example usage: