-
Notifications
You must be signed in to change notification settings - Fork 76
Guide to the codebase
This page is to help developers get a sense of where to find things in the Uproot codebase.
The purpose of Uproot is to provide an array-oriented, pure Python way of reading and writing ROOT files. It doesn't address any part of ROOT other than I/O, and it provides basic, low-level I/O. It's not an environment/user interface: if someone wants an immersive experience, they can write packages on top of Uproot, such as uproot-browser. However, we do want to streamline the process of navigating through files with Uproot, to the point of being "not annoying."
Although the style of programming is almost entirely imperative, not array-oriented like Awkward Array, there's a wide range in "depth" of code editing. Some changes would be "surface level," making a function more friendly/ergonomic to use, while others are "deep," manipulating low-level byte streams.
All of the source code is in src/uproot. The tests are (roughly) numbered by PR or issue number, and the version number is controlled by src/uproot/version.py (not by pyproject.toml, even though this is a hatchling-based project). If there is no version.py or it has a placeholder version, the information in this paragraph may be out-of-date. (Please update!)
Within src/uproot, all of the files and directories are for reading except writing, sink, and serialization.py, which are for writing. A few are shared, such as models, _util.py, compression.py, and streamers.py.
Everything is for conventional ROOT I/O, not RNTuple, except for models/RNTuple.py with some shared utilities in compression.py and const.py.
So almost all of the code, per line and per file, is for reading conventional ROOT I/O.
A ROOT file consists of
- the TFile header, which starts at byte 0 of the file and has one of two fixed sizes (one uses 32-bit pointers and the other uses 64-bit pointers);
- the TStreamerInfo, which describes how to deserialize most classes—which byte means what—reading this is optional;
- the root (no pun) TDirectory, which describes where to find named objects, including other TDirectories;
- the named objects themselves, which can each be read separately, but must each be read in their entirety;
- a few non-TDirectory classes (TTree and RNTuple are the only ones I know of) point to data beyond themselves;
- chunks of data associated with a TTree (TBasket) or RNTuple (RBlob);
- the TFree list, which keeps track of which parts of the ROOT file are unoccupied by real data. This can be completely ignored when reading a ROOT file.
None of the objects listed above except the TFile header has a fixed location in the file. To know the byte location of any object, one must find it by following a chain from the TFile header to the root TDirectory to any subdirectory to the object and maybe to a TBasket if the object is a TTree.
RNTuple has its own system of navigation, starting at a ROOT::Experimental::RNTuple
instance, which is a conventional ROOT I/O object that can live in a TDirectory like any TTree or histogram, but its headers, footers, column metadata, etc., are all new, custom objects, exposed to the conventional ROOT I/O system as generic RBlobs.
TBaskets and RBlobs can't be (or at least, aren't in practice) stored in a TDirectory.
If multiple objects have the same name in the same TDirectory, they're distinguished by a cycle number. It's common for ROOT to gradually update an object (such as TTree) by writing updates in the same directory with different cycle numbers, keeping only the most recent two. (Uproot updates TTrees in place with a single cycle number.)
Addressing and reading/writing data in a ROOT file is like reading/writing data in RAM, but instead of pointers, we have seek positions. Most conventional ROOT I/O seek positions are 32-bit, but there are modes in which objects can point to other objects with 64-bit seek positions when necessary. A ROOT file can (and often does) have a mixture of 32-bit and 64-bit seek positions.
Also like addressing data in RAM, space has to be allocated and freed when objects are created or deleted (when writing). Deleting an object creates a gap that is not filled by moving everything else in the file (which can be many GB), and new objects should take advantage of this space if they'll fit, rather than always allocating at the end of the file. This is why the file maintains a TFree list, just like malloc
and free
in libc. This can be ignored while reading, but keep in mind that any part of a ROOT file might be uninitialized junk, just like RAM.
A TDirectory consists of an array of TKeys, which specify the name, cycle number, title, class name, compressed size, uncompressed size, and seek position of the object. At the seek position, there's another TKey with nearly all the same fields, to characterize the object if you didn't find it from a TDirectory (such as TBaskets and RBlobs). TDirectory and TKey are never compressed; the data they point to may be.
Any C++ class instance that ROOT's reflection understands (i.e. anything compiled by Cling) can be written to the file and read back later. What actually gets written are the data members of the C++ class—public and private—and none of the code. Class definitions change, and a ROOT file may be written with one version of a class (with members x
and y
, say) and read by a process in which the class has different members (x
, y
, and z
). Thus, each class needs a numerical version—a single, increasing integer—and the ROOT file should have TStreamerInfo records for all the versions of all the classes it contains.
ROOT files don't always have TStreamerInfo records for all the classes they contain. Some very basic classes, defined before TStreamerInfo "dictionary" generation was automated, have TStreamerInfo records that don't seem to match the actual serialization or none at all. Also, the classes needed to read the TStreamerInfo can't be derived from TStreamerInfo itself. (This includes TStreamerInfo, all of the subclasses of TStreamerElement, TList, and TString.) Most often, files lacking TStreamerInfo records that are absolutely necessary for deserializing the objects were produced by hadd. (This comes up repeatedly in issues: there's nothing we can do if we don't have the TStreamerInfo.)
C++ ROOT has a large number of built-in classes. If a ROOT file contains objects of the same class names and versions that were compiled into that version of ROOT, ROOT can use its built-in streamer knowledge. Uproot has a smaller set of built-in streamer knowledge, consisting of histograms and TTrees from the past 10 years (beginning 5 years before the Uproot project started and staying up to date as new ROOT versions come out).
It also sometimes happens that users compile non-release versions of ROOT (from GitHub or nightlies) and the C++ class name-version combinations in these ROOT executables have different TStreamerInfo from the same name-version combinations in released versions of ROOT. Uproot needs to be flexible with the assumptions it makes about how data are serialized. In practice, this means that Uproot makes up to two attempts to read each object: first using its built-in streamer knowledge (so that it doesn't need to read a file's TStreamerInfo) and if that files, it reads the file's TStreamerInfo and attempts to read the object again.
In principle, the serialization format of C++ class instances in TTrees is the same as the serialization format of the same class elsewhere, in a TDirectory, for instance. Some optimizations complicate that story, however.
- Most often, objects in TTrees are "split" into their constituent fields, with one TBranch per field. This is why a TTree's TBranches can contain child TBranches, to keep track of which TBranches came from the same class. Even though this changes how the data are laid out, we like split objects because (1) numerical data can be read much more quickly than if it had been embedded in classes, that would have to be iterated over, in Python, (2) if part of a class is unreadable for some reason, it's likely that the parts a user cares about are in numerical fields, which are readable as separate TBranches, and (3) if a user is only interested in a few members of a class, they don't have to read the other members at all. This last reason was the motivation for splitting in the first place. (RNTuple is based on splitting at all levels, everywhere, like Awkward Arrays.)
- Normally, class instances are preceded by a 4-byte integer specifying the number of serialized bytes and a 2-byte class version. This applies not only to the top-level class named in the TDirectory (such as TH1F), but also its constituent superclasses (such as TH1, TNamed, TObject, ...) and members (such as TAxis, TObjArrays of TFunctions, ...). High bits in the 4-byte integer can specify that the class version will be skipped (saving 2 bytes per nested object), and some TBranches specify that all headers will be skipped (saving 6 bytes per object). We don't know where all of the indicators of whether headers are skipped or not are located, which is the source of a few issues.
- TTree data has an additional mode called "memberwise splitting," which is indicated in the high bits of the 4-byte header. Memberwise splitting is like TBranch splitting but at a smaller level of granularity: instead of all members
x
of a TTree's classes being in TBranchparent_branch.x
and all membersy
of that class being in TBranchparent_branch.y
, a memberwise-split TBranch has allx
contiguous for list items within an entry/event followed by ally
within the same entry/event. They are not split between TBranches and they are not split between entries (which usually correspond to events in HEP). Uproot has not implemented reading of memberwise-split data, except in one experimental case. We can, however, identify when memberwise splitting is used and raise an error.
Whole objects—that is, each entire object with all its superclasses and members—addressed in a TDirectory can be compressed. Compression is identified by the compressed size being smaller than its uncompressed size. (Otherwise, we assume that it is not compressed.) In a TTree, compression is only applied at the level of a whole TBasket, which can contain many objects. A compressed object is a sequence of independently compressed blocks, each with a header (compression algorithm, compressed size, uncompressed size, and a checksum in the case of LZ4) and the compressed data. It's a sequence because the compressed data size can be larger than the largest expressible compressed size, which is a 3-byte number.
The actual compression algorithm and compression level used may be entirely different from the fCompress
specified in the TFile header, the TTree, and the TBranch that the data belongs to. For instance, TStreamerInfo blocks are often ZLIB compressed, even if the TFile specifies LZ4.
As stated above, RNTuple is entirely different. After navigating to the `ROOT::Experimental::RNTuple" object (also called an "anchor"), a newly designed layout takes over, which has very little in common with the old ROOT I/O (one exception: compressed objects have the same format). This new format has a specification, so many of the problems we have finding information (e.g. about whether headers exist or not) wouldn't even come up. RNTuple is functionally equivalent to an Awkward/Apache Arrow/Feather dataset on disk—fully split into columns, with metadata to find the columns and specify their data types.
Uproot is not only an independent implementation of ROOT I/O, but also Python, rather than C++, so we make some different decisions from ROOT.
First of all, we don't assume that a ROOT file can change while we're reading it and we don't assume that another process can change the file while we're writing it. We assume that users treat ROOT files as fixed artifacts, copying from an input file to an output file if need be, rather than using it as a shared filesystem. Although Uproot has an "update" mode that can add or delete objects from an existing ROOT file, it is not thread-safe: multiple Python threads cannot write to the same file. Also when writing objects to a file, Uproot uses a different allocation strategy than ROOT (always keeps the TFree at the end of the file), but as long as it maintains a correct TFree list, it's compatible.
Uproot does not run Cling or any C++, so class methods are either reimplemented in Python or are not available at all. A C++ user who creates custom classes with custom methods has to load a shared library/DLL to use those methods; there's no equivalent in Uproot. Moreover, Uproot is not a look-alike of C++ ROOT: it implements different methods than ROOT because Python has different needs.
Perhaps the biggest difference is in TTree-reading: ROOT is designed around iterating over TTree data, producing C++ class instances on demand and sometimes reusing a preallocated instance to avoid memory management in the loop, but Uproot is designed around filling arrays—NumPy, Awkward, or Pandas—for other libraries to perform computations on. Some TTrees are so large that the TBranches of interest can't be fully loaded into RAM, and for this case, uproot.iterate loads contiguous sets of entries/events in each loop iteration, but this is an elaboration of the primary access method, which is about eagerly loading data into memory.
Accordingly, normal access methods in ROOT hide the splitting of classes into sub-TBranches, so that the split-level is an optimization detail. Uproot always exposes each TBranch as an individual object to read—in this sense, Uproot is more low-level, since the way that you'd read a split TTree is different from the way you'd read an unsplit TTree.
The equivalent of a C++ class in Uproot is a Model. Model instances generally aren't created with a constructor, but are read directly from a ROOT file (with the Model.read
classmethod). Rather than mapping C++ classes onto Python classes directly—mapping C++ members to Python attributes and the C++ class hierarchy onto the Python class hierarchy—Uproot's Models are representations of the C++ class as data:
- C++ member data are in a
Model._members
dict - C++ object superclasses are in a
Model._bases
list
so getting an inherited member from some model means checking the local dict, then recursively searching the members of the Model instances in the Model._bases
list. There are Python methods for doing these searches, but C++ data are "held at arm's length" from Python itself.
(Historical note: before Uproot 4, C++ classes and Python classes were directly mapped, but it was harder to maintain because "Uproot doing its own work" got mixed with "data acting like ROOT data.")
The Model class name encodes the C++ class name-version pair through classname_encode
/classname_decode
functions. C++ class names include namespaces (::
) and template instantiations (<
, >
, and ,
), which can't be included in a Python class name. These characters, as well as any underscore in the C++ name, are converted into hexadecimal codes surrounded by underscores. The whole name is prepended by Model_
and appended by _v#
where #
is the class version number, so it's impossible to confuse a C++ class name for a Python Model name, even if the C++ name doesn't use any illegal characters.
Here's an example:
>>> cpp_name = "std::sorted_map<int,std::string>"
>>> cpp_version = 6
>>> model_name = uproot.model.classname_encode(cpp_name, cpp_version)
>>> model_name
'Model_std_3a3a_sorted_5f_map_3c_int_2c_std_3a3a_string_3e__v6'
>>> uproot.model.classname_decode(model_name)
('std::sorted_map<int,std::string>', 6)
You'll see a lot of _3a3a_
(for ::
) and _3c_
... _3e_
(for <
... >
) in Model class names. Note that translating the underscores into _5f_
(between sorted
and map
) ensures that the transformation is always reversible, and it's not possible to confuse any _v#
suffixes that users put at the ends of their class names with ours.
Models for most C++ classes that exist are generated on the fly. When a deserialization routine encounters a class that isn't in the global uproot.classes
dict or the relevant file's ReadOnlyFile._custom_classes
dict, Uproot reads the file's ReadOnlyFile.streamers
(if it hasn't already) and uses the TStreamerInfo to generate Python code for the class and evaluate it. Then it has a new Model to Model.read
the object. The Model class definitions are put into a Python module called uproot.dynamic, which is empty when Uproot is first imported. (It's not necessary for dynamically generated classes to be in a module in Python; this is for possible user convenience.)
Models can be versionless (no _v#
suffix in the name) or versioned; all dynamically generated Models are versioned. Models for the most commonly used classes (histograms and TTrees), ROOT classes that don't seem to agree with their TStreamerInfo or don't have TStreamerInfo (basic classes like TObject, TRef, TDatime, ...), or are needed for the reading of TStreamerInfo itself (TStreamerInfo, all the TStreamerElement subclasses, and TList) are predefined in the uproot.models module. Each of these classes is a submodule containing the built-in class. Since they are hand-written, many of them are versionless. If there are any version-specific differences in deserialization, they may be handled with if-then-else clauses in their read
classmethod or read_members
method. (See, for instance, this version-dependent branch in Model_TStreamerInfo.)
The uproot.classes dict is the global collection of Model classes built so far. This dict maps C++ class names (strings) to versionless Model class definitions or a DispatchByVersion class object (see src/uproot/model.py, which keeps a known_versions
dict to map version numbers (integers) to versioned Model class definitions. Some of these DispatchByVersion class objects have been built by hand, such as those in src/uproot/models/TAtt.py.
Sometimes, we can't actually build a Model class object, for a variety of reasons. At the very least, we build a Model whose name starts with Unknown_
(rather than Model_
) and put it in the uproot.unknown_classes
dict. In some cases, these unknown objects can be skipped over, allowing subsequent data to be deserialized. In other cases, it can't, and deserializing an unknown Model instance raises an error.
Very few Models can be serialized, mostly just those that support Uproot's writing of histograms and TTrees. All of the serializable Models have hand-written serialization methods—generic serialization from TStreamerInfo has not been implemented.
However, those that can be serialized can be converted into PyROOT objects, and all PyROOT objects can be converted into Uproot Models (if the Model class is in uproot.classes
), using the functions in src/uproot/pyroot.py. These translations go through ROOT's TMessage system (serializing and deserializing in memory).
C++ classes are loaded into Python Models without any of their class methods. For classes that we do not recognize—because the set of classes in ROOT is vast or because the set of classes users can define is infinite—
Three classes representing ROOT data are not Models: ReadOnlyFile, ReadOnlyDirectory, and ReadOnlyKey (all defined in src/uproot/reading.py).