-
Notifications
You must be signed in to change notification settings - Fork 76
Physical layer interface
In this wiki, I'll document the interface that Uproot expects from Source
classes, such as HTTPSource and XRootDSource. This document only describes reading because only reading has been implemented (Nov 2020).
When a user opens a file with uproot4.open, the URL scheme determines which Source subclass to use:
- non-URL or
"file://"
defaults to MemmapSource, but MultithreadedFileSource is also a good option, -
"http://"
and"https://"
default to HTTPSource, -
"root://"
defaults to XRootDSource, - a non-string, non-Path object with
read
andseek
methods is handled by ObjectSource.
The uproot4.open returns a ReadOnlyDirectory, which has a file property pointing to a ReadOnlyFile, which has a source property pointing to the actual Source. The Source may be stateful, with open file handles and associated threads. When any object derived from uproot4.open exits a with
statement (through its __exit__
) or is explicitly closed, __exit__
calls are propagated all the way down to the Source, so that it can close or shutdown whatever it needs to.
The job of a Source is to deliver Chunk objects on demand. A Chunk represents physical bytes of a file, uninterpreted and (if directly from a Source), possibly compressed. The data in a Chunk might not have been read yet, but they have been requested. A Chunk is defined by:
- a pointer back to the Source,
-
start and stop, the
inclusive:exclusive
byte positions in the file, - a future, which adheres to a subset of the Python 3 Future protocol. (It only has to have a result method.)
The rest of Uproot interfaces with Chunk objects through get and remainder to get the raw data from the file (through the future) as a numpy.ndarray
of dtype numpy.uint8
. The act of requesting data from a Chunk blocks until its future actually delivers.
The interpretation of those bytes is out of scope for the physical layer: the physical layer only needs to deliver bytes (in futures) on demand.
The two chunk-delivering methods that a Source must implement are
-
chunk (singular), which takes a
start:stop
interval and returns one Chunk, and -
chunks (plural), which takes a list of
(int, int)
pairs and anotifications
queue.Queue.
The first method, chunk (singular), doesn't need much explanation. The Chunk it returns may be synchronous (its future is a trivial NoFuture, recently renamed as TrivialFuture
) or asynchronous, with some background thread delivering its value. This method is used to extract items from a ReadOnlyDirectory, so each __getitem__
is a single request-response cycle. (Perhaps values, items, itervalues, iteritems should be updated to use chunks (plural), to efficiently read all histograms from a directory, for instance, but that would be a future improvement.)
The second method, chunks (plural), takes a list of intervals to read and returns a list of Chunk objects, which, again, may be synchronous or asynchronous. Depending on implementation, this interface may get a large set of discontiguous byte intervals in one request (as HTTPSource and XRootDSource do), or spawn many concurrent requests (as MultithreadedHTTPSource and MultithreadedXRootDSource do), or something else. This method is called by array-fetching functions (TBranch.array, HasBranches.arrays, HasBranches.iterate, uproot4.iterate, uproot4.concatenate, and uproot4.lazy) to batch requests for every TBasket
needed to generate arrays.
The possibly asynchronous results are provided in the return value, the list of Chunk objects, but remember that reading a value from a Chunk means waiting for the source to respond. In principle, that could happen in any order, so reading the Chunk objects from first to last could be the wrong order: if they're filled from last to first, the downstream code would end up waiting when it could be decompressing and interpreting arrays while the remote source streams them. For this reason, they are also returned by adding them to a supplied notifications
queue.Queue. Both the return value and the notifications
queue.Queue get the same Chunk output, but the notifications
get them after they are ready to be read. Calling get on the notifications
guarantees an optimal order. Uproot only uses the notifications
output; the return value of this method is ignored.
In principle, the interface can be simplified. ("Further simplified," since it used to be more complex, with more options.) In principle, the chunks (plural) method could return nothing and only fill the notifications
queue.Queue, and it could fill it with numpy.ndarray
objects, rather than Chunk objects, since they're known to be ready for reading if they appear on the notifications
. The reason I didn't remove this structure is to allow for flexibility in the future. To be future-proof, a Source must return Chunk objects in both the return value and the notifications
.
The Source abstract base class also defines
- num_bytes, the number of bytes in the file, which isn't currently used by Uproot. For now, an implementation can get by without implementing this, though it would be good to add for the sake of future-proofing it. One should definitely avoid an extra round trip at startup to get this information, though an extra round trip on request is acceptable because we don't yet know how it would be used.
- num_requests, num_requested_chunks, and num_requested_bytes are performance counters, not explicitly used by the Uproot codebase, but we expect to use them to fine-tune performance in the future. They're easy to implement; see existing Source implementations for an indication of how they're defined.
- close (causes the file handles to be released and any threads or other resources to be shut down) and closed (boolean indicator).
-
__enter__
and__exit__
, where__exit__
must call close.
Everything needed to be a compliant Source is described above. The uproot4.source module defines some more classes and functions that help a Source do its job.
The first of these is Resource, whose name is unfortunately close to "Source" (at least one issue was based on that mistake). This holds a file handle, a TCP connection, or whatever stateful object represents an "open file." Multithreaded readers hold one Resource for each thread, so that the statefulness doesn't interfere with multithreading.
The MultithreadedSource implements as much of the infrastructure of MultithreadedFileSource, MultithreadedHTTPSource, and MultithreadedXRootDSource as can be independent of backend. It uses a ResourceThreadPoolExecutor to manage those threads (launch them when the source is created; shut them down when close or __exit__
is called). This object differs from a Python ThreadPoolExecutor in that each each ResourceWorker thread is bound to a Resource. The ResourceThreadPoolExecutor.submit interface differs from Executor.submit in that the task
is called with the Resource as its only argument, rather than externally supplied *args
and **kwargs
. This way, attempts to fetch data can be made to use the open file/connection associated with the thread that they're running in and we don't have to lock stateful objects. This is why Source implementations have getter
functions to make get
functions—because all of the arguments other than the choice of Resource need to be bound up in a closure.
I hope it's clear why we have our own ResourceThreadPoolExecutor implementation instead of subclassing Python's ThreadPoolExecutor. The management of thread-bound Resource objects affects the ResourceWorker implementation, which is not part of Python's public API.
The uproot4.ThreadPoolExecutor and uproot4.source.futures.Future are more generic; they're subsets of Python's ThreadPoolExecutor and Future, representing the subset that Uproot actually uses (the decompression_executor
and interpretation_executor
parameters of many functions). Since they are subsets, a standard Executor could be used in their place, and since they only apply to decompression and interpretation, they're not part of the physical layer. They're in the uproot4.source.futures module because they have such a similar function.