TIMBER (Tree Interface for Making Binned Events with RDataFrame) is an easy-to-use and fast python analysis framework used to quickly process CMS data sets. Default arguments assume the use of the NanoAOD format but any ROOT TTree can be processed.
Despite the fact that Python 2.7 reached end-of-life on January 1st, 2020, it is still the dominant version used by CMS. If you need CMSSW (ex. for JME modules), Python 2.7 is recommended. Otherwise, please take this opportunity to start using Python 3! Remember to make sure your ROOT version has been built with Python 3 compatibility. For information on how to do this, see this explanation. Though this does not always work consistently when CMSSW code is needed.
Working in a virtual environment is also recommended. Below are the commands for using virtualenv but
you're obviously free to use your favorite tool for the job (you can install virtualenv for Python 3 with
pip install virtualenv
(pip3
for Python 3)).
python -m virtualenv timber-env
source timber-env/bin/activate
git clone https://github.com/lcorcodilos/TIMBER.git
cd TIMBER
source setup.sh
Some C++ modules also have the boost library as a dependency.
The internet has plenty instructions on how to install boost. The standard apt-get
(Ubuntu) and brew
(macOS) package managers support install as well.
TIMBER's speed comes from the use of
ROOT's RDataFrame.
RDataFrame offers "multi-threading and other low-level optimisations" which make analysis level
processing faster and more efficient than traditional python for
loops. However,
RDataFrame derives its speed from its C++ back-end and while an RDataFrame object can be instantiated
and manipulated in python, any actions on it are written in C++ (even if you're using python).
Using RDataFrame means a fundamental re-thinking of how we treat a block of data or simulation. Instead of looping over the events or entries of a TTree (or other data format), the TTree is converted into a table called the "data frame". A user then books a number of "lazy" actions on the data frame such as filtering out events or calculating new values. These actions aren't performed though until the data frame needs to be evaluated (ex. you ask to plot a histogram from it).
In this way, there are no more for
loops and instead just actions on the data frame table that
transform it into a final table of values that the analyzer cares about.
Each row of the table is a separate event and each column is a different variable in the event (a branch in TTree terms). Columns can be single values or vectors (specifically ROOT::VecOps:RVec).
Since each row is an event, vectors are necessary for the case of multiple of the same physics object in an event - for example, multiple electrons.
NOTE NanoAOD orders these vectors in \f$p_T\f$ of the objects. So if you'd like the \f$\eta\f$ of the leading electron, it is stored as Electron_eta[0]
This can make accessing values tricky! For example, if there's one electron in an event and the analyzer asks for Electron_eta[1]
, the computer will return a seg fault. These are the types of problems that TIMBER attempts to solve (even if it's just by users sharing their experiences).
TIMBER is meant to keep both the processing fast via RDataFrame and the analyzer fast via python scripting.
To maintain python's appeal in HEP as a quick scripting language, TIMBER handles interfacing with RDataFrame so the analyzer can focus on writing their analysis.
TIMBER automates opening one or many ROOT files, calculating the number of events generated (provided the ROOT files are NanoAOD simulation), loading in C++ scripts for use while looping over the data frame, and grouping actions for easy manipulation.
In addition, TIMBER treats each step in the RDataFrame processing as a "node" and keeps track of these nodes as a larger tree.
Each action (or group of actions) performed on a node produces another node and nodes store information about their parents or children. This makes it possible to write tools like Nminus1()
which takes as input a node and a group of cuts to apply and returns N new nodes, each with every cut but one applied.
Finally, the RDataFrame for each node is always kept easily accessible so that any of the native RDataFrame tools are at the user's fingertips.
TIMBER includes a repository of common algorithms used frequently in CMS
which access scale factors, calculate pileup weights, and more. These are all written
in C++ for use in Cut
and Define
arguments and are provided so that users have a common tool box to share.
Additionally, the AnalysisModules folder welcomes additions of custom C++ modules on a
per-analysis basis so that the code can be properly archived for future reference and for sharing
with other analyzers.