Skip to content

Guide to the codebase

Jim Pivarski edited this page Jan 14, 2023 · 26 revisions

This page is to help developers get a sense of where to find things in the Uproot codebase.

Overview

The purpose of Uproot is to provide an array-oriented, pure Python way of reading and writing ROOT files. It doesn't address any part of ROOT other than I/O, and it provides basic, low-level I/O. It's not an environment/user interface: if someone wants an immersive experience, they can write packages on top of Uproot, such as uproot-browser. However, we do want to streamline the process of navigating through files with Uproot, to the point of being "not annoying."

Although the style of programming is almost entirely imperative, not array-oriented like Awkward Array, there's a wide range in "depth" of code editing. Some changes would be "surface level," making a function more friendly/ergonomic to use, while others are "deep," manipulating low-level byte streams.

Structure

All of the source code is in src/uproot. The tests are (roughly) numbered by PR or issue number, and the version number is controlled by src/uproot/version.py (not by pyproject.toml, even though this is a hatchling-based project). If there is no version.py or it has a placeholder version, the information in this paragraph may be out-of-date. (Please update!)

Within src/uproot, all of the files and directories are for reading except writing, sink, and serialization.py, which are for writing. A few are shared, such as models, _util.py, compression.py, and streamers.py.

Everything is for conventional ROOT I/O, not RNTuple, except for models/RNTuple.py with some shared utilities in compression.py and const.py.

So almost all of the code, per line and per file, is for reading conventional ROOT I/O.

How conventional ROOT I/O works

A ROOT file consists of

  • the TFile header, which starts at byte 0 of the file and has one of two fixed sizes (one uses 32-bit pointers and the other uses 64-bit pointers);
  • the TStreamerInfo, which describes how to deserialize most classes—which byte means what—reading this is optional;
  • the root (no pun) TDirectory, which describes where to find named objects, including other TDirectories;
  • the named objects themselves, which can each be read separately, but must each be read in their entirety;
  • a few non-TDirectory classes (TTree and RNTuple are the only ones I know of) point to data beyond themselves;
  • chunks of data associated with a TTree (TBasket) or RNTuple (RBlob);
  • the TFree list, which keeps track of which parts of the ROOT file are unoccupied by real data. This can be completely ignored when reading a ROOT file.

When we read a ROOT file with Uproot, we assume that it is in read-only mode and that other processes are not changing the file while we read it. C++ ROOT does not make this assumption, which is considerably more complicated.

When we write a ROOT file with Uproot, we assume that a single-threaded Uproot process is the only writer of that file. Again, C++ ROOT is more general, and thus more complex.

None of the objects listed above except the TFile header has a fixed location in the file. To know the byte location of any object, one must find it by following a chain from the TFile header to the root TDirectory to any subdirectory to the object and maybe to a TBasket if the object is a TTree.

RNTuple has its own system of navigation, starting at a ROOT::Experimental::RNTuple instance, which is a conventional ROOT I/O object that can live in a TDirectory like any TTree or histogram, but its headers, footers, column metadata, etc., are all new, custom objects, exposed to the conventional ROOT I/O system as generic RBlobs.

TBaskets and RBlobs can't be (or at least, aren't in practice) stored in a TDirectory.

Addressing and reading/writing data in a ROOT file is like reading/writing data in RAM, but instead of pointers, we have seek positions. Most conventional ROOT I/O seek positions are 32-bit, but there are modes in which objects can point to other objects with 64-bit seek positions when necessary. A ROOT file can (and often does) have a mixture of 32-bit and 64-bit seek positions.

Also like addressing data in RAM, space has to be allocated and freed when objects are created or deleted (when writing). Deleting an object creates a gap that is not filled by moving everything else in the file (which can be many GB), and new objects should take advantage of this space if they'll fit, rather than always allocating at the end of the file. This is why the file maintains a TFree list, just like malloc and free in libc. The algorithm that Uproot uses when writing and the algorithm that C++ ROOT uses are different, but they each maintain a correct TFree list, so it's compatible. This can be ignored while reading, but keep in mind that any part of a ROOT file might be uninitialized junk, just like RAM.

Any C++ class instance that ROOT's reflection understands (i.e. anything compiled by Cling) can be written to the file and read back later. What actually gets written are the data members of the C++ class—public and private—and none of the code. Class definitions change, and a ROOT file may be written with one version of a class (with members x and y, say) and read by a process in which the class has different members (x, y, and z). Thus, each class needs a numerical version—a single, increasing integer—and the ROOT file should have TStreamerInfo records for all the versions of all the classes it contains.

ROOT files don't always have TStreamerInfo records for all the classes they contain. Some very basic classes, defined before TStreamerInfo "dictionary" generation was automated, have TStreamerInfo records that don't seem to match the actual serialization or none at all. Also, the classes needed to read the TStreamerInfo can't be derived from TStreamerInfo itself. (This includes TStreamerInfo, all of the subclasses of TStreamerElement, TList, and TString.) Most often, files lacking TStreamerInfo records that are absolutely necessary for deserializing the objects were produced by hadd. (This comes up repeatedly in issues: there's nothing we can do if we don't have the TStreamerInfo.)

Clone this wiki locally