Skip to content

Guide to the codebase

Jim Pivarski edited this page Jan 14, 2023 · 26 revisions

This page is to help developers get a sense of where to find things in the Uproot codebase.

Overview

The purpose of Uproot is to provide an array-oriented, pure Python way of reading and writing ROOT files. It doesn't address any part of ROOT other than I/O, and it provides basic, low-level I/O. It's not an environment/user interface: if someone wants an immersive experience, they can write packages on top of Uproot, such as uproot-browser. However, we do want to streamline the process of navigating through files with Uproot, to the point of being "not annoying."

Although the style of programming is almost entirely imperative, not array-oriented like Awkward Array, there's a wide range in "depth" of code editing. Some changes would be "surface level," making a function more friendly/ergonomic to use, while others are "deep," manipulating low-level byte streams.

Structure

All of the source code is in src/uproot. The tests are (roughly) numbered by PR or issue number, and the version number is controlled by src/uproot/version.py (not by pyproject.toml, even though this is a hatchling-based project). If there is no version.py or it has a placeholder version, the information in this paragraph may be out-of-date. (Please update!)

Within src/uproot, all of the files and directories are for reading except writing, sink, and serialization.py, which are for writing. A few are shared, such as models, _util.py, compression.py, and streamers.py.

Everything is for conventional ROOT I/O, not RNTuple, except for models/RNTuple.py with some shared utilities in compression.py and const.py.

So almost all of the code, per line and per file, is for reading conventional ROOT I/O.

How conventional ROOT I/O works

A ROOT file consists of

  • the TFile header, which starts at byte 0 of the file and has one of two fixed sizes (one uses 32-bit pointers and the other uses 64-bit pointers);
  • the TStreamerInfo, which describes how to deserialize most classes—which byte means what—reading this is optional;
  • the root (no pun) TDirectory, which describes where to find named objects, including other TDirectories;
  • the named objects themselves, which can each be read separately, but must each be read in their entirety;
  • a few non-TDirectory classes (TTree and RNTuple are the only ones I know of) point to data beyond themselves;
  • chunks of data associated with a TTree (TBasket) or RNTuple (RBlob);
  • the TFree list, which keeps track of which parts of the ROOT file are unoccupied by real data. This is necessary because a ROOT file is like a little filesystem that can get fragmented (deleting an object does not mean that all subsequent objects are copy-shifted to fill the gap). This can be completely ignored when reading a ROOT file.

When we read a ROOT file with Uproot, we assume that it is in read-only mode and that other processes are not changing the file while we read it. C++ ROOT does not make this assumption, which is considerably more complicated.

When we write a ROOT file with Uproot, we assume that a single-threaded Uproot process is the only writer of that file. Again, C++ ROOT is more general, and thus more complex.

Clone this wiki locally