Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primitives for reading a file in parallel #113

Merged
merged 5 commits into from
Feb 23, 2020

Conversation

shwestrick
Copy link
Collaborator

In our benchmarking, we've run into a recurring issue that it is unreasonably slow to read large files with basis IO functionality (e.g. TextIO.inputAll). The obvious alternative would be to use Posix.IO and Posix.FileSys, and indeed these are fast sequentially, but they also unfortunately provide no opportunity for parallelism.

There is a standard trick to work around the lack of parallelism with POSIX: just mmap files into memory and then read from them directly. This is fast sequentially and highly parallel. So, I implemented it in MPL.

Library Spec

This patch provides a structure MPL.File with this signature:

signature MPL_FILE =
sig
  type t

  exception Closed

  val openFile: string -> t
  val closeFile: t -> unit
  val size: t -> int

  val readChar: t -> int -> char
  val readWord8: t -> int -> Word8.word
  val unsafeReadChar: t -> int -> char
  val unsafeReadWord8: t -> int -> Word8.word

  val readChars: t -> int -> char ArraySlice.slice -> unit
  val readWord8s: t -> int -> Word8.word ArraySlice.slice -> unit
end

A file of type MPL.File.t can be created by passing a path string to openFile, which mmaps the file into memory (as private and read-only). Later, you must explicitly close the file with closeFile to free up this memory. Otherwise, there will be a space leak.

The functions readChars and readWord8s take a file, an offset, and an output buffer, and copy enough bytes from the file (starting at the given offset) to fill up the output buffer. These can be safely called in parallel, even on overlapping regions of the file, but the output buffers must be disjoint to avoid data races. The one-byte versions readChar and readWord8 just return the byte at the given offset.

Except for the unsafe functions, if you attempt to read outside the range of the file, the exception Subscript will be raised, and any operation on a closed file will raise Closed. Note that closing a file is not safe for concurrency (e.g., if you close a file while concurrently reading from it, this could crash). This is something that needs to be fixed in the future.

Intended Use

Here is a function defined in terms of MPL.File that reads a file in parallel. It behaves as though it were purely functional.

fun inputAll (path: string): char array =
  let
    val file = MPL.File.openFile path
    val n = MPL.File.size file
    val result = ForkJoin.alloc n
    val k = 10000 (* block size, for granularity control *)
    val b = 1 + (n-1) div k (* number of blocks *)
  in
    ForkJoin.parfor 1 (0, b) (fn i =>
      let
        val lo = i*k
        val hi = Int.min (lo+k, n)
        val slice = ArraySlice.slice (result, lo, SOME (hi-lo))
      in
        MPL.File.readChars file lo slice
      end);
    MPL.File.closeFile file;
    result
  end

Example usage:

val fileNames = ["./hello/world", "../../i/hope/you/are", "../having/a/nice/day"]
val contents: char array list = map inputAll fileNames

@shwestrick
Copy link
Collaborator Author

This patch also begins the process of implementing a dedicated MPL library, as mentioned in #103.

You can load this structure by including this in your .mlb:

$(SML_LIB)/basis/mpl.mlb

@shwestrick shwestrick merged commit facf548 into MPLLang:master Feb 23, 2020
@shwestrick shwestrick deleted the fast-file-io branch February 24, 2020 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant