This document corresponds to version 0.9 of the NUISANCE HEPData Conventions.
This document aims to provide a set of conventions on top of the established HEPData format specification that allow NUISANCE to automatically construct predictions for datasets from simulated vectors of neutrino interactions. These conventions should not limit the type of measurements that can be expressed and are meant to cover the majority of existing published data.
We want to implement the minimum set of specific cases that cover the majority of existing and envisioned measurement types, without the expectation that every single measurement will fit in one of our pre-defined types. For other types of measurements, many of the conventions in this document can still be followed to enable a succinct custom NUISANCE implementation through extending one of these types or compositing existing utilities. See What To Do If My Measurement Doesn't Fit?.
- Checklist
- HEPData Records
- Building a Submission
- Tools and Utilities
- What To Do If My Measurement Doesn't Fit?
Below is an at-a-glance checklist for producing compliant HEPData records for most measurements. See the rest of the document for details.
- [âś…] Dependent Variables that correspond to a cross section measurement must have a Qualifier with key
variable_type
. For most measurements, the most appropriate value iscross_section_measurement
. - [âś…] For each Independent Variable, named
var
, in a table, a Qualifier with the keyvar:projectfunc
must exist on each measurement Dependent Variable, and the value must be a valid Resource Reference to the name of a projection function in a snippet file. See Projection and Selection Snippets. - [âś…] Projection functions should be named according to the convention:
<Experiment>_<MeasurementSpecifier>_<Year>_<ProjectionName>_<INSPIREHEPId>
. This avoids symbol naming collisions when loading many records simultaneously. - [âś…] Each measurement Dependent Variable must include a least one
probe_flux
qualifier. See Probe Flux. - [âś…] Each measurement Dependent Variable must include a least one
target
qualifier. See Target - [âś…] Each measurement Dependent Variable should include one
cross_section_units
qualifier. See Cross Section Units - [âś…] Measurements that include a covariance estimate must include an
errors
qualifier. The value must be a valid Resource Reference to an errors table. See Errors. - [âś…] Measurements presented in some smeared space must include a
smearing
qualifier. The value must be a valid Resource Reference to a smearing table. See Smearing.
The top level data structure for a HEPData submission is called a Record. It can be referenced by a unique Id number. This document will not unneccessarily detail the HEPData format as it is authoratatively documented elsewhere. Records are described by one or more YAML
files and can also contain others files in a range of formats as Additional Resources.
Table: A HEPData Table broadly corresponds to a set of binned or unbinned axis (Independent Variables) definitions and the corresponding values over those axes (Dependent Variables). Dependent Variables are used to store measurements, predictions, error and smearing matrices.
Binned Independent Variables: The HEPData table format allows for fully generic hyper-rectangular bins in any number of dimensions. This is generic enough for any measurement that the authors are aware of. If your measurement makes use of non-hyper-rectangular bins, see What To Do If My Measurement Doesn't Fit? for ideas.
Qualifiers: HEPData Qualifiers are Key-Value pairs attached to Dependent Variables as metadata. These conventions describe a number of Qualifiers that may or must be present for a table to be compliant and automatically consumeable by NUISANCE.
Additional Resources: Additional files can be added at either the record or the table level.
We define a format for record references as below, where []
denote optional components, <>
denote the reference data itself, and =
specifies a default value:
[<type=hepdata>:][<id>][/<resource[:<qualifier>]>]
All parts of the reference are optional. In the absence of a reference id
, the reference is considered an intra-record reference, referring to a resource contained within the same HEPData record. In the absence of a type
, the reference is considered an inter-record reference, referring to a resource contained within another HEPData record. In the absence of a resource
component, the reference is considered a generic link to another record and not a pointer to a specific resource of that record.
The type
of a reference is freeform, but apart from the special (and default) hepdata
type a generic referred resource will not be automatically retrievable. As HEPData uses INSPIREHEP ids as a foriegn key for its records, the inspirehep
type can be used to link to the HEPData record corresponding to a specific INSPIREHEP record. Other useful types might include: arxiv
, zenodo
, doi
, among others. To refer to a specific resource, such as a flux prediction or covariance matrix, the resource
component should be used. The qualifier
sub-component is resource specific and is included to enable referring to sub-resources.
Some specific examples with explanations are given below:
Example Reference | Comments |
---|---|
MyCrossSection |
Refers to a table named MyCrossSection in the current HEPData record. |
12345/MyCrossSection |
Refers to a table named MyCrossSection in HEPData record 12345 . |
inspirehep:123/MyCrossSection |
Refers to a table named MyCrossSection in HEPData record with INSPIREHPE id 123 . |
hepdata-sandbox:678910/MyCrossSection |
Refers to a table named MyCrossSection in HEPData Sandbox record 678910 . |
12345/MyCrossSection:Bkg |
Refers specifically to the Bkg Dependent Variable of table MyCrossSection in HEPData record 12345 . |
12345/moredata.yaml |
Refers to an Additional Resource file, named moredata.yaml in HEPData record 12345 . |
12345/flux.root:flux_numu |
Refers to a specific object (in this case a histogram) named flux_numu in the Additional Resource file flux.root in HEPData record 12345 . |
12345/analysis.cxx:SelF |
Refers to a specific object (in this case a function) named SelF in the Additional Resource file analysis.cxx in HEPData record 12345 . |
HEPData Sandbox: Because the HEPData REST API differentiates between public and sandboxed records, a separate reference type, hepdata-sandbox
, must be defined to enable access to records that are in the sandbox. Public records should never link to sandboxed records, but sandboxed records may link to either other sandboxed records or public records.
- Measurement Qualifiers
Qualifier Key | Required | Example Value |
---|---|---|
variable_type |
Yes | cross_section_measurement |
measurement_type |
No | flux_averaged_differential_cross_section |
selectfunc |
Yes | ana.cxx:MicroBooNE_CC0Pi_2019_Selection_123456 |
<var>:projectfunc |
Yes | ana.cxx:MicroBooNE_CC0Pi_2019_DeltaPT_123456 , ana.cxx:MicroBooNE_CCInc_2017_PMu_4321 |
<var>:prettyname |
No | $p_{\mu}$ |
prettyname |
No | $\\mathrm{d}\\sigma/\\mathrm{d}p_{\\mu}$ |
cross_section_units |
No | pb|per_target_nucleon|per_first_bin_width |
target |
Yes | CH , C[12],H[1] , 1000180400 |
probe_flux |
Yes | 123456/MicroBooNE_CC0Pi_2019_flux_numu , flux_numu[1],flux_nue[0.9] |
test_statistic |
No | chi2 |
errors |
No | 123456/MicroBooNE_CC0Pi_2019_DeltaPT_covar |
smearing |
No | 123456/MicroBooNE_CC0Pi_2019_DeltaPT_smearing |
- Additional Measurement Qualifiers for Composite Measurements
Qualifier Key | Required | Example Value |
---|---|---|
selectfunc[1...] |
No | ana.cxx:MicroBooNE_CC0Pi_2019_Selection_123456 |
<var>:projectfunc[1...] |
No | ana.cxx:MicroBooNE_CC0Pi_2019_DeltaPT_123456 |
<var>:prettyname[1...] |
No | $p_{\mu}\\ [\\mathrm{GeV}/c]$ |
target[1...] |
No | CH , C[12],H[1] , 1000180400 |
probe_flux[1...] |
No | 123456/MicroBooNE_CC0Pi_2019_flux_numu , flux_numu[1],flux_nue[0.9] |
sub_measurements |
No | MicroBooNE_CC0Pi_2019_DeltaPTx,MicroBooNE_CC0Pi_2019_DeltaPTy |
- Flux table qualifiers
Qualifier Key | Required | Example Value |
---|---|---|
variable_type |
Yes | probe_flux |
probe_particle |
Yes | numu , -14 |
bin_content_type |
Yes | count , count_density |
- Error table qualifiers
Qualifier Key | Required | Example Value |
---|---|---|
variable_type |
Yes | error_table |
error_type |
Yes | covariance , correlation , universes |
- Smearing table qualifiers
Qualifier Key | Required | Example Value |
---|---|---|
variable_type |
Yes | smearing_table |
smearing_type |
Yes | smearing_matrix |
truth_binning |
no | smear_bins |
- Prediction table qualifiers
Qualifier Key | Required | Example Value |
---|---|---|
variable_type |
Yes | cross_section_prediction |
for_measurement |
No | cross_section |
expected_test_statistic |
No | 12.34 |
pre_smeared |
No | true |
label |
No | CC Total , CC0$\pi$ |
The variable_type
qualifier is used to explicitly mark any Dependent Variable that follows this convention. Dependent Variables without the variable_type
qualifier will generally be ignored by NUISANCE. Values other than those in the below table will generally cause a processing error.
Value | Comments |
---|---|
cross_section_measurement |
Use this for measurements where a single table, probe flux, and set of selection and projection functions can be used to predict the measurement. This covers most measurements. |
composite_cross_section_measurement |
Use this for more complicated measurements that require multiple selections, or multiple separate targets, or multiple fluxes. Also use this to construct a 'meta-measurement' that composits multiple other sub-measurements. For more details, see Composite Measurements. |
probe_flux |
Use this to mark a variable as describing the flux of a probe particle. |
error_table |
Use this to mark a variable as describing the error of another measurement. While Dependent Variables can contain bin-by-bin independent errors, a separate table is required to describe a matrix of correlated errors for a measurement. For more details, see Errors |
smearing_table |
Use this to mark a variable as describing the input to a smearing process from generator 'true' space to the measurement space. For more details, see Smearing |
cross_section_prediction |
Use this to mark a variable as containing a prediction for a cross_section_measurement -type dependent variable. |
For each HEPData Table, one measurement_type
Qualifer may exist to signal how a prediction should be transformed before a comparison can be made. If none exists, the measurement is assumed to be of default type, flux_averaged_differential_cross_section
.
Value | Comments |
---|---|
flux_averaged_differential_cross_section |
The most common type of published neutrino-scattering measurement. A prediction is made by selecting and projecting simulated neutrino interactions from one or more neutrino species drawn from a neutrino energy spectra, or flux shape. |
event_rate |
Some historic data are presented as simple event rates. For this case, only the shape of simulated predictions can be compared to the data. |
ratio |
Ratios must often be treated specially as, in general, multiple simulation runs are required to make a fully prediction using either different targets or different probe spectra, or both. Ratio measurements must always use the composite_cross_section_measurement variable type. |
total_cross_section |
Some historic measurements attempt to unfold the neutrino flux shape and neutrino energy reconstruction effects from observations to make direct measurements of the total scattering cross section as a function of neutrino energy. While this approach has fallen out of favor due to issues with model dependence, there are a number of data sets that are important to preserve. |
This list may be extended in future versions.
To make a prediction for a measurement from an MCEG event vector, the corresponding signal definition and observable projections must be defined. We use ProSelecta to codify and execute the selection and projection operators on an event vector. Each Dependent Variable that corresponds to a measurement must contain one selectfunc
Qualifier and one <var>:projectfunc
Qualifier per corresponding Independent Variable. The values of these qualifiers should be valid Resource References to ProSelecta functions defined in an Analysis Snippet.
Selection functions return an integer, which can be used with a composite measurement to select subsamples with a single selection function. For more details, see Composite Measurements. For most selection operators, either a 1 (signal) or a 0 (not signal) should be returned.
For the majority of published data, a measurement will take the form of a scattering cross section. There are a number of historic conventions for the units and additional target material scale factors used in published cross section measurements. To avoid clumsy parsing of the HEPData units variable metadata, the explicit units for the measured cross section should also be declared in a fixed form in cross_section_units
Qualifier. The value of the qualifier takes the form of |
-separated 'flags', at most one from each grouping shown in the table below.
For consistency, we follow the NuHepMC reserved keywords for the Unit and the Target scales defined in G.C.4, but add the PerTargetNeutron
and PerTargetProton
options to support existing measurements.
Value | Comments |
---|---|
Unit | |
cm2 |
A common unit for cross-section measurements that requires us to carry around a power-of-ten scale factor that is dangerously close to the minimum value representable by single precision IEEE 754 floating point numbers. |
1e-38 cm2 |
Tries to avoid the 1E-38 factor by including it in the unit definition. |
pb |
1 pb = 1E-36 cm2 |
nb |
1 nb = 1E-33 cm2 |
Target Scaling | |
PerTarget |
Include no explicit additional scaling on the total rate for calculated neutrino--target interactions. This will often correspond to a cross section per elemental nucleus, such as carbon. It also covers simple elemental combinations commonly used for hydrocarbon targets, such as CH, CH2, or CH4. It can also be used for more complex molecular targets. |
PerTargetNucleon |
Existing neutrino-scattering simulations often report the cross-section per target nucleon and the neutrino--nucleon interaction is considered the dominant fundamental process. |
PerTargetNeutron |
Some data reports the cross section per 'active' nucleon, which for neutrino CCQE interactions with nuclei consists only of the bound neutrons. |
PerTargetProton |
As for PerTargetNeutron but for processes that can only occur with target protons. |
Density Scaling | |
per_bin_width |
Include no explicit additional scaling on the total rate for calculated neutrino--target interactions. This will often correspond to a cross section per elemental nucleus, such as carbon. It also covers simple elemental ratios commonly used for hydrocarbon targets, such as CH, CH2, or CH4. It can also be used for more complex molecular targets. |
The assumed, or default, value for this qualifier, following the majority of published data, is cross_section_units=1E-38 cm2|PerTarget|per_bin_width
Target materials can be specified by a fully complete syntax, or a useful shorthand. Annotated examples below.
Example Value | Comments |
---|---|
C |
A carbon target |
1000060120 |
A carbon 12 target |
CH |
A hydrocarbon target with an average carbon:hydrogen nuclear ratio of 1:1 |
1000060120,1000010010 |
A hydrocarbon target with a nuclear ratio of 1:1 |
1000060120[1],1000010010[1] |
A hydrocarbon target with an average carbon:hydrogen mass ratio of 1:1 (equivalent to a nuclear ratio of about 1:12) |
CH2 |
A hydrocarbon target with an average carbon:hydrogen nuclear ratio of 1:2 |
Ar |
An argon 40 target |
1000180400 |
An argon 40 target |
1000180390 |
An argon 39 target |
For composite measurments, where multiple, separate targets are needed, additional target specifiers, with an indexed key, e.g. target[1]
, can be specified. For more details, see Composite Measurements.
Sometimes a naive chi2 test statistic is not the most appropriate to use for a given data set, or the errors are encoded with a non-gaussian PDF. Some examples currently handled are given as examples here.
Value | Comments |
---|---|
chi2 |
|
shape_only_chi2 |
|
shape_plus_norm_chi2 |
|
poisson_pdf |
It is very useful if data releases are packaged with an example prediction with the expected test statistic value so that a degree of automated validation can take place. See 7. Predictions for more information.
The remaining measurement Qualifiers all comprise references to other tables that contain related or required information.
Qualifier Key | Usage |
---|---|
probe_flux |
One or more comma-separated Resource References to Probe Flux tables. If multiple references are specified, the cross section from each flux will be weighted by the ratio of the integrals of the referenced flux tables by default. Relative weights for the different components can be forced similarly to the Target specifiers. For example, flux_numu[1],flux_nue[2] would produce a combined event rate for the two flux components with the contribution from flux_nue scaled up by 2. |
errors |
A simple, single Resource Reference to a table describing the correlated errors for this measurement. For more details, see Errors. |
smearing |
A simple, single Resource Reference to a table describing the smearing procedure for this measurment. For more details, see Smearing. |
The <var>:prettyname
Qualifier can be used to improve the labelling of automatically constructed comparison figures. The structue of the value is fairly freeform, but the expected use is for small latex snippets that plotting software would be able to render. Note that the pretty name should generally not include the units, as they can be retrieved from the independent variable header on the relevant table. Similarly the prettyname
Qualifier can be used to provide a pretty name for the dependent variable with a similar caveat about the units.
Composite measurements are both relatively rare and difficult to solve the 'automation' problem for, in general. Instead, we aim to provide some useful coverage for composite measurements that we have encountered, and hopefully leave enough space in the specification for new, powerful composite to fit in the future.
Tables with variable_type=composite_cross_section_measurement
can specify multiple selectfunc
, <var>:projectfunc
, target
, and probe_flux
Qualifiers by postfixing the relevant keys with an index, for example, selectfunc[1]
. These indexes must be sequential, however, [0]
may be omitted as it is considered implicit in the key selectfunc
.
The first step of parsing a composite_cross_section_measurement is a check that the number of components is valid. Generally, if only a single instance of one of these Qualifiers is given, it is assumed that that can be used for all components of the composite measurement. If multiple instances are given for any Qualifier, then other Qualifiers that also have multiple instances must have the same number. You cannot have a measurement with selectfunc
, selectfunc[1]
, target
, target[1]
, and target[2]
.
Signal Selection Function: If a single selection function is given for a measurement, it is assumed that the integer returned corresponds to the 1-indexed component measurement that the event should be sifted into, i.e. if the selection function returns a 1
, it will be included in the first component measurement, if it returns a 2
, it will be included in the second component measurement, if it returns a 0
, the event is considered not-signal for all components. If you need events to be included in multiple components of the simultaneous measurement, then you must provide a selectfunc[i]
key for each component, these keys can point to the same ProSelecta function.
Sub-Measurements: The sub_measurements
Qualifier should be used to refer to other measurement tables that this composite measurement combines. This is useful for providing uncertainty estimates for ratios or combinations of other published measurements.
For a ratio measurement, we want to define exactly 2 component measurements. The first corresponds to the numerator, and the second the denominator of the ratio.
Qualifier Key | Value |
---|---|
variable_type |
composite_cross_section_measurement |
measurement_type |
ratio |
selectfunc |
MINERvA_CCIncRatio_2016_Select_12345 |
Q2:projectfunc |
MINERvA_CCIncRatio_2016_Q2_12345 |
target |
Pb |
target[1] |
CH |
probe_flux |
543/MINERVA_Flux_2012:flux_numu |
errors |
MINERvA_CCIncRatio_2016_covar_12345 |
A relatively recent trend, that we hope to see continue, is the publication of 'joint' measurements, that provide correlated error estimates for multiple new, or previous, measurements. To achieve this, we could rebuild the separate datasets into a single new table here with an Independent Variable corresponding to the target material and specify the target as target=C,O
. However, it is better to re-use the existing data where possible and only provide the minimal additional information we need, in this case, a covariance matrix covering both sub measurements.
Qualifier Key | Value |
---|---|
variable_type |
composite_cross_section_measurement |
measurement_type |
flux_averaged_differential_cross_section |
sub_measurements |
T2K_CC0Pi_C_2017_12345,T2K_CC0Pi_O_2019_12345 |
errors |
T2K_CC0Pi_JointCO_2019_covar_12345 |
A probe flux Dependent Variable must have two Qualifiers that specify the probe particle, probe_particle
and how to interpret the Dependent Variable value, bin_content_type
. Probe particles can either be specified by a PDG MC particle code or by a human-readable shorthand. The shorthands defined are: nue[bar]
, numu[bar]
, e-
, e+
, pi-
, pi+
, p
. As the two main neutrino beam groups present flux distributions with different conventions, and correspondingly, the different neutrino MCEGs assume flux histograms using different units conventions, we have decided to require explicitity. The below two valid values for the bin_content_type
Qualifier should be use to specify whether the flux table should be considered a standard histogram in the flux of probes (count
) or a PDF-like object (count_density
), for which we often use histograms in HEP.
Value | Comments |
---|---|
count |
The bin value corresponds directly to the flux of probes in units specified on the Dependent Variable header. This will usually have units like neutrinos /cm^2 /POT . |
count_density |
The bin value corresponds directly to the flux density of probes in units specified on the Dependent Variable header. This will usually have units like neutrinos /cm^2 /50 MeV /POT . |
Covariance and correlation matrices should be provided as tables with two Independent Variables corresponding to the global bin number of the measurement(s) that they cover. It is important to take care when providing covariances for multidimensional measurements that the mapping of, for example the 3 dimensional bin (i,j,k)
to a global index g
, is done consistently between the data table and the covariance matrix. These conventions assert that the ordering of bins on the data table should exactly match the ordering of the global index in the corresponding error matrix.
The only Qualifier currently specified for error matrices is error_type
, and the possible values are shown below.
Value | Comments |
---|---|
covariance |
The absolute covariance. The error on the corresponding bin can be found by simply square rooting the value. |
inverse_covariance |
A pre-inverted absolute covariance. Sometimes error estimation can produce ill-conditioned or difficult to invert matrices and it can be useful for measurements to supply pre-inverted matrices for use in defining test statistics. |
fractional_covariance |
A fractional covariance. The fractional error on the corresponding bin can be found by simply square rooting the value. |
correlation |
A correlation matrix. If a correlation matrix is provided, it will usually be converted back to a covariance matrix by assuming the errors provided on the data tables correspond to the standard error on each bin. In this case, the error component total , will be used. If no total error component can be found on the corresponding table, an exception should be thrown. |
universes |
A table with one dependent variable per statistical/systematic universe |
While Error tables should use the same format as data tables, the yaml file for the table can be included as an additional resource on the record or on the corresponding measurement table. The file size limits differ for tables and additional resources and covariance matrices of many-binned measurements can be quite large as an uncompressed yaml file.
Publishing non-true space measurements and smearing procedures is relatively rare, but is becoming more common thanks to the desireable statistical properties of techniques like Wiener SVD unfolding. The only valid value for the smearing_type
Qualifier is smearing_type=smearing_matrix
. The Independent Variable should be defined as in Errors.
If a qualifier keyed truth_binning
exists and its value is a resource reference to a table with independent variables, the binning scheme defined by those variables will be used to bin the unsmeared prediction, prior to smearing to the comparison space. This allows for different true/smeared binning schemes and non-rectangular smearing matrices, where appropriate. A variable_type=cross_section_prediction
type dependent variable with pre_smeared=false
can be useful to use to define the true binning and facilitate automated testing of the smearing procedure. In the absence of this qualifier, the binning scheme defined on the original variable_type=[composite_]cross_section_measurement
table will be used, as for measurements without a smearing component.
While Smearing tables should use the same format as data tables, the yaml file for the table can be included as an additional resource on the record or on the corresponding measurement table. The file size limits differ for tables and additional resources and smearing matrices for many-binned measurements can be quite large as an uncompressed yaml file.
While it is very difficult to fully test something as complicated as a cross section measurement, including analyser-checked predictions, with associated pre-calculated test statistic values, can be very useful for users of the data.
One or more model predictions may be included with a data release. Often it is simplest to include the predictions as secondary dependent variables on the relevant table, to avoid duplicating the definition of the independent variables. The following Qualifiers can be used to decorate a dependent variable as a prediction
Qualifier Key | Comments |
---|---|
for_measurement |
If this Qualifier is not present, then the prediction will be assumed to be for the first dependent variable of the [composite]_cross_section_measurement type on the same table. |
expected_test_statistic |
A precalculated numerical value of the expected test statistic between this prediction and the measurement. Can be used for a limited degree of automatic test statistic validation. |
pre_smeared |
Can be true or false . If the measurement that this prediction is for contains a smearing step, this marks whether the smearing has already been applied to this prediction. It is often a better test to provide an unsmeared prediction, to enable automated testing to also test the smearing procedures. Both pre_smeared=true and pre_smeared=false predictions can be included for a single measurement. |
label |
A free-form test label that can be used to succinctly describe a prediction; it is often useful to include sub-components of a full measurement prediction and the label can be used to differentiate them. May be included a legend titles in automated comparison plots. |
Analysis Snippets are relatively short C++ source files that contain implementations of event selection and projection operators. These should be written to work with the ProSelecta environment. Generally they are included as a single Additional Resource file per record, which contains all of the functions used by that record.
To avoid problems that would be encountered from sequentially loading multiple analysis snippets containing identically named functions, we provide a naming convention for selection and projection functions that should be followed where possible.
Function Type | Naming Convention | Example |
---|---|---|
Selection | <Experiment>_<MeasurementSpecifier>_<Year>_Select_<INSPIREHEPId> |
int MicroBooNE_CC0Pi_2019_Selection_123456(HepMC3 const &) |
Projection | <Experiment>_<MeasurementSpecifier>_<Year>_<ProjectionName>_<INSPIREHEPId> |
double MicroBooNE_CC0Pi_2019_DeltaPT_123456(HepMC3 const &) |
This section contains some useful python examples that you should be able to easily adapt to build a compliant HEPData submission for your data release. It will illustrate building a data release for the T2K on/off axis CC0Pi cross section measurement (PRD.108.112009).
We will make extensive use of the useful hepdata_lib python module and recommend you do the same. If you have feature requests for the library, please contact us, the authors, or just submit a clear Issue on the github.
We recommend starting your ToHepData.py
script with imports and at least the INSPIRE_id for your paper.
#!/usr/bin/env python3
import os, csv
import yaml, ROOT
import requests
from hepdata_lib import Submission, Table, Variable, Uncertainty, RootFileReader
ref = "PRD.108.112009"
INSPIRE_id=2646102
In many instances you can make the script download and untar an existing version of the data release for you. This is only helpful if the HEPData release you are building is not the first release of this data.
if not os.path.exists("onoffaxis_data_release/analysis_flux.root"):
if not os.path.exists("onoffaxis_data_release.tar.gz"):
# make the initial request
req = requests.get("https://zenodo.org/records/7768255/files/onoffaxis_data_release.tar.gz?download=1", params={"download": 1})
# check the response code
if req.status_code != requests.codes.ok:
raise RuntimeError("Failed to download data release from: https://zenodo.org/records/7768255/files/onoffaxis_data_release.tar.gz?download=1")
# write the response to disk
with open("onoffaxis_data_release.tar.gz", 'wb') as fd:
for chunk in req.iter_content(chunk_size=128):
fd.write(chunk)
# ask a system shell to untar it for you
os.system("mkdir -p onoffaxis_data_release && tar -zxvf onoffaxis_data_release.tar.gz -C onoffaxis_data_release")
# instantiate a submission object
submission = Submission()
# the details of these functions will be detailed below
# -> construct all of the tables constituting the data release
submission.add_table(nd280_analysis())
submission.add_table(ingrid_analysis())
cov,ana = joint_analysis()
submission.add_table(cov)
submission.add_table(ana)
submission.add_table(build_flux_table("ingrid_flux_fine_postfit","flux-onaxis-postfit-fine"))
submission.add_table(build_flux_table("nd280_flux_fine_postfit","flux-offaxis-postfit-fine"))
# add this script to the data release so the steps involved in its creation are transparent
submission.add_additional_resource(description="Python conversion script used to build this submisson. Part of NUISANCE.",
location="ToHepData.py",
copy_file=True)
# add the ProSelecta snippet file to the data release
submission.add_additional_resource(description="Selection and projection function examples. Can be executued in the ProSelecta environment v1.0.",
location="analysis.cxx",
copy_file=True,
file_type="ProSelecta")
# add some useful links to the data release
submission.add_link(description="official data release", location="https://doi.org/10.5281/zenodo.7768255")
submission.add_link(description="Use with NUISANCE3", location="https://github.com/NUISANCEMC/nuisance3")
submission.add_link(description="Adheres to the NUISANCE HEPData Conventions", location="https://github.com/NUISANCEMC/HEPData/tree/main")
# build the submission files, ready for upload
submission.create_files(f"submission-{INSPIRE_id}", remove_old=True)
We're going to look at parsing the file nd280_analysis_binning.csv
from the data release in question. It starts like this:
bin,low_angle,high_angle,low_momentum,high_momentum
1, -1.00, 0.20, 0, 30000
2, 0.20, 0.60, 0, 300
3, 0.20, 0.60, 300, 400
4, 0.20, 0.60, 400, 500
5, 0.20, 0.60, 500, 600
6, 0.20, 0.60, 600, 30000
...
We can parse the required bin edge information using the standard library csv
module:
import csv
cos_thetamu_bins = []
pmu_bins = []
with open("onoffaxis_data_release/nd280_analysis_binning.csv", newline='') as csvfile:
# by default this class
csvreader = csv.DictReader(csvfile)
for row in csvreader:
cos_thetamu_bins.append((float(row["low_angle"]), float(row["high_angle"])))
pmu_bins.append((float(row["low_momentum"])/1E3, float(row["high_momentum"])/1E3))
Or we can use the ubiquitous numpy
module:
import numpy as np
nd280_analysis_binning = np.genfromtxt("onoffaxis_data_release/nd280_analysis_binning.csv",
delimiter=",", skip_header=1)
# take the 2nd and 3rd column as the cos theta bin edges
cos_thetamu_bins = nd280_analysis_binning[:,1:3]
# and the 4th
pmu_bins = nd280_analysis_binning[:,3:5] / 1E3
We then use this to read the binning, data values, and errors from the data release into the relevant hepdata_lib
objects, as below.
def nd280_analysis():
nd280_analysis_binning = np.genfromtxt("onoffaxis_data_release/nd280_analysis_binning.csv",
delimiter=",", skip_header=1)
CosThetaVar = Variable("cos_theta_mu", is_independent=True, is_binned=True, units="")
CosThetaVar.values = nd280_analysis_binning[:,1:3]
PVar = Variable("p_mu", is_independent=True, is_binned=True, units=r"$\mathrm{MeV}/c$")
PVar.values = nd280_analysis_binning[:,3:5] / 1E3
xsec_data_mc = np.genfromtxt("onoffaxis_data_release/xsec_data_mc.csv",
delimiter=",", skip_header=1)
# the first 58 rows are nd280-only
nd280_data_mc = xsec_data_mc[:58,...]
CrossSection = Variable("cross_section", is_independent=False, is_binned=False,
units=r"$\mathrm{cm}^{2} c/\mathrm{MeV} /\mathrm{Nucleon}$")
CrossSection.values = nd280_data_mc[:,1]
# qualify the variable type and measurement type
CrossSection.add_qualifier("variable_type", "cross_section_measurement")
CrossSection.add_qualifier("measurement_type", "flux_averaged_differential_cross_section")
# add the selection and projection ProSelecta function reference qualifiers
CrossSection.add_qualifier("selectfunc", "analysis.cxx:T2K_CC0Pi_onoffaxis_nu_SelectSignal")
CrossSection.add_qualifier("cos_theta_mu:projectfunc", "analysis.cxx:T2K_CC0Pi_onoffaxis_nu_Project_CosThetaMu")
CrossSection.add_qualifier("p_mu:prettyname", r"$cos(\theta_{\mu})$")
CrossSection.add_qualifier("p_mu:projectfunc", "analysis.cxx:T2K_CC0Pi_onoffaxis_nu_Project_PMu")
CrossSection.add_qualifier("p_mu:prettyname", r"$p_{\mu}$")
CrossSection.add_qualifier("prettyname", r"$d^2\sigma/dp_{\mu}dcos(\theta_{\mu})$")
# add the target specifier and probe_flux reference qualifiers
CrossSection.add_qualifier("target", "CH")
CrossSection.add_qualifier("probe_flux", "flux-offaxis-postfit-fine")
# if the publication includes predictions, it is often useful to also include
CrossSectionNEUT = Variable("cross_section_neut-prediction", is_independent=False,
is_binned=False, units="$cm${}^{2} c/MeV /Nucleon$")
CrossSectionNEUT.values = nd280_data_mc[:,2]
CrossSectionNEUT.add_qualifier("variable_type", "cross_section_prediction")
cov_matrix = np.genfromtxt("onoffaxis_data_release/cov_matrix.csv",
delimiter=",")
nd280_cov_matrix = cov_matrix[:58,:58]
TotalUncertainty = Uncertainty("total", is_symmetric=True)
TotalUncertainty.values = np.sqrt(np.diagonal(nd280_cov_matrix))
CrossSection.add_uncertainty(TotalUncertainty)
xsTable = Table("cross_section-offaxis")
xsTable.description = """Extracted ND280 cross section as a function of muon momentum in angle bins compared to the nominal NEUT MC prediction. Note that the final bin extending to 30 GeV=c has been omitted for clarity."""
xsTable.location = "FIG. 21. in the publication"
xsTable.add_variable(PVar)
xsTable.add_variable(CosThetaVar)
xsTable.add_variable(CrossSection)
xsTable.add_variable(CrossSectionNEUT)
xsTable.add_image("fig21.png")
xsTable.keywords["observables"] = ["D2SIG/DP/DCOSTHETA"]
xsTable.keywords["reactions"] = ["NUMU C --> MU- P"]
xsTable.keywords["phrases"] = ["Neutrino CC0Pi", "Cross Section"]
return xsTable
The ingrid_analysis
method is very similar. Below is a method that builds a composite_cross_section_measurement
table providing the covariance matrix that spans both the nd280_analysis
and ingrid_analysis
data.
def joint_analysis():
cov_matrix = np.genfromtxt("onoffaxis_data_release/cov_matrix.csv",
delimiter=",")
# all bin definitions are built as a single array, each bin then has an extent or value in
# every relevant independent variable, or dimension
allbins = []
for j in np.arange(cov_matrix.shape[0]):
for i in np.arange(cov_matrix.shape[0]):
allbins.append((i,j))
allbins = np.array(allbins)
bin_i = Variable("bin_i", is_independent=True, is_binned=False, units="")
bin_i.values = allbins[:,0]
bin_j = Variable("bin_j", is_independent=True, is_binned=False, units="")
bin_j.values = allbins[:,1]
Covariance = Variable("covariance", is_independent=False, is_binned=False, units=r"$(cm${}^{2} c/MeV /Nucleon)^{2}$")
# ravel flattens the array to 1D, row-major by default
Covariance.values = np.ravel(cov_matrix)
inv_cov_matrix = np.genfromtxt("onoffaxis_data_release/inv_matrix.csv",
delimiter=",")
Invcovariance = Variable("inverse_covariance", is_independent=False, is_binned=False, units=r"$(cm${}^{2} c/MeV /Nucleon)^{-2}$")
Invcovariance.values = np.ravel(inv_cov_matrix)
Covariance.add_qualifier("variable_type", "error_table")
Covariance.add_qualifier("error_type", "covariance")
Invcovariance.add_qualifier("variable_type", "error_table")
Invcovariance.add_qualifier("error_type", "inverse_covariance")
covmatTable = Table("covariance-onoffaxis")
covmatTable.description = """This table contains the covariance and pre-inverted covariance for the joint on/off axis analysis. See the covered measurements for the constituent measurements."""
covmatTable.add_variable(bin_i)
covmatTable.add_variable(bin_j)
covmatTable.add_variable(Covariance)
covmatTable.add_variable(Invcovariance)
jointTable = Table("cross_section-onoffaxis")
CrossSection = Variable("cross_section", is_independent=False)
CrossSection.add_qualifier("variable_type", "composite_cross_section_measurement")
CrossSection.add_qualifier("sub_measurements", "cross_section-offaxis,cross_section-onaxis")
CrossSection.add_qualifier("error", "covariance-onoffaxis:inverse_covariance")
jointTable.add_variable(CrossSection)
return (covmatTable,jointTable)
def build_flux_table(hname, tname):
# instantiate a table object
FluxTable = Table(tname)
# for simple root histograms, we can use the hepdata_lib RootFileReader helper object to
# convert a histogram into a useful format
# See https://hepdata-lib.readthedocs.io/en/latest/usage.html#reading-from-root-files
reader = RootFileReader("onoffaxis_data_release/analysis_flux.root")
flux_histo = reader.read_hist_1d(hname)
# define the 'x' axis independent variable
EnuVar = Variable("e_nu", is_independent=True, is_binned=True, units="GeV")
EnuVar.values = flux_histo["x_edges"]
# define the 'y' axis dependent variable
FluxVar = Variable("flux_nu", is_independent=False, is_binned=False, units="$/cm^{2}/50MeV/10^{21}p.o.t$")
FluxVar.values = flux_histo["y"]
# add some all-important qualifiers. qualifiers can only be added to dependent variables
FluxVar.add_qualifier("variable_type", "probe_flux")
FluxVar.add_qualifier("probe_particle", "numu")
FluxVar.add_qualifier("bin_content_type", "count_density")
# add the variables to the table
FluxTable.add_variable(EnuVar)
FluxTable.add_variable(FluxVar)
# return the table for addition to the submission
return FluxTable
This repository includes a C++ model of the HEPData conventions described above. Additionally, a set of factory functions are provided that can automatically download and unpack HEPData records to a local cache for querying. A local copy of a record can be downloaded, parsed, and queried with the HEPDataRecord
class. An example follows:
#include "nuis/HEPData/TableFactory.hxx"
//..
// will fetch record and unpack into ./database if it doesn't exist
auto rec = nuis::HEPData::make_Record("hepdata:12345", "./database");
for(auto const & xsm : rec.measurements){
// print the location of the yaml tablefile
std::cout << xsm.source.native() << std::endl;
// print the measurement target
// as measurements in general can have multiple groups of composite
// targets, we have to jump through some vector hoops to get the first one
std::cout << xsm.targets.front().front() << std::endl;
// or you can use the get_simple_target() helper function, which will throw
// if there is more than one target specification.
// print the selection and projection function specifications, same hoops apply
std::cout << xsm.selectfuncs.front().source.native() << ": "
<< xsm.selectfuncs.front().fname << std::endl;
// or you can use the get_single_selectfunc() helper
// see nuis/HEPData/CrossSectionMeasurement.h for more details
}
This repository includes the pyNUISANCEHEPData
python bindings for the nuis::HEPData::Record
C++ interface described above. A local copy of a record can be downloaded, parsed, and queried with the Record
class. An example follows:
import os
import pyNUISANCEHEPData as nhd
local_db = f"{os.environ['home']}/.local/nuisancedb"
record_id = "hepdata:12345"
nhr = nhd.make_Record(record_id, local_db)
print("Cross Section Tables:")
for xsm in nhr.measurements:
print(f"\tyaml source: {xsm.source}")
print(f"\tselection function: {xsm.selectfuncs[0].fname}")
print(f"\tprojection functions:")
for i,iv in enumerate(xsm.independent_vars):
print(f"\t\t{xsm.projectfuncs[0][i].fname}")
print("Additional Resources:")
for ar in nhr.additional_resources:
print(f"\t{ar}")
This repository also contains a CLI tool for querying and populating a local database of HEPData records called nuis-hepdata
. It is built on the nuis::HEPDataRecord
tools but offers shell scripting capabilities for record database management. Full documentation can be obtained by running nuis-hepdata help
, but some example usage is shown below.
This example will use a hepdata-sandbox record, but the examples generalise to public hepdata records identified by their HEPData id or their INSPIRHEP id.
Firstly we will ensure that we have a local copy of the record of interest by running the get-local-path
command. In general all commands will transparently trigger a record fetch from HEPData.net if a local copy doesn't already exist.
$ nuis-hepdata --nuisancedb ./database get-local-path hepdata-sandbox:1713531371
./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/submission.yaml
We can see what is in the record directory:
$ ls ./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1
ToHepData.py cross_section-offaxis.yaml fig21.png flux-offaxis-nominal-fine.yaml
flux-onaxis-nominal-coarse.yaml flux-onaxis-postfit-fine.yaml thumb_fig22.png analysis.cxx
cross_section-onaxis.yaml fig22.png flux-offaxis-postfit-coarse.yaml flux-onaxis-nominal-fine.yaml
submission.yaml covariance-onoffaxis.yaml cross_section-onoffaxis.yaml flux-offaxis-nominal-coarse.yaml
flux-offaxis-postfit-fine.yaml flux-onaxis-postfit-coarse.yaml thumb_fig21.png
If we want to follow the remote request logic we can add a --debug
option, as below:
$ nuis-hepdata --nuisancedb ./database --debug get-local-path hepdata-sandbox:1713531371
[2024-09-23 22:51:42.607] [debug] Checking latest version for unversioned ref=hepdata-sandbox:1713531371
[2024-09-23 22:51:42.608] [debug] GET https://www.hepdata.net/record/sandbox/1713531371
[2024-09-23 22:51:45.541] [debug] http response --> 200
[2024-09-23 22:51:45.544] [debug] resolved reference with concrete version to: hepdata-sandbox:1713531371v1
[2024-09-23 22:51:45.549] [debug] ensure_local_path(ref=hepdata-sandbox:1713531371v1,local_cache_root=./database): expected_location = ./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/submission.yaml
[2024-09-23 22:51:45.554] [debug] Doesn't exist, downloading...
[2024-09-23 22:51:45.554] [debug] GET https://www.hepdata.net/record/sandbox/1713531371 -> ./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/submission.zip
[2024-09-23 22:51:48.647] [debug] http response --> 200
[2024-09-23 22:51:48.647] [debug] unzipping: system(cd ./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1 && unzip submission.zip )
Archive: submission.zip
inflating: fig22.png
inflating: thumb_fig21.png
inflating: flux-offaxis-nominal-fine.yaml
inflating: ToHepData.py
inflating: flux-offaxis-nominal-coarse.yaml
inflating: flux-onaxis-postfit-fine.yaml
inflating: flux-onaxis-postfit-coarse.yaml
inflating: cross_section-onaxis.yaml
inflating: flux-offaxis-postfit-fine.yaml
inflating: flux-offaxis-postfit-coarse.yaml
inflating: cross_section-offaxis.yaml
inflating: fig21.png
inflating: flux-onaxis-nominal-fine.yaml
inflating: thumb_fig22.png
inflating: analysis.cxx
inflating: flux-onaxis-nominal-coarse.yaml
inflating: covariance-onoffaxis.yaml
inflating: submission.yaml
inflating: cross_section-onoffaxis.yaml
[2024-09-23 22:51:48.687] [debug] resolved to: ./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/submission.yaml
./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/submission.yaml
As all records include a version qualifier that is often omitted as it is usually '1'. Record references without the version qualifier will always trigger a remote check to see if a later version of the record is available. This request can be elided by fully qualifying the reference with the version number that you know you have a local copy of. See the difference between the two below requests.
$ nuis-hepdata --nuisancedb ./database get-local-path hepdata-sandbox:1713531371 --debug
[2024-09-23 22:52:43.362] [debug] Checking latest version for unversioned ref=hepdata-sandbox:1713531371
[2024-09-23 22:52:43.363] [debug] GET https://www.hepdata.net/record/sandbox/1713531371
[2024-09-23 22:52:45.351] [debug] http response --> 200
[2024-09-23 22:52:45.357] [debug] resolved reference with concrete version to: hepdata-sandbox:1713531371v1
[2024-09-23 22:52:45.362] [debug] ensure_local_path(ref=hepdata-sandbox:1713531371v1,local_cache_root=./database): expected_location = ./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/submission.yaml
./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/submission.yaml
$ nuis-hepdata --nuisancedb ./database get-local-path hepdata-sandbox:1713531371v1 --debug
[2024-09-23 22:52:11.195] [debug] ensure_local_path(ref=hepdata-sandbox:1713531371v1,local_cache_root=./database): expected_location = ./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/submission.yaml
./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/submission.yaml
The first, unqualified attempt has to check the record metadata to ensure that the latest version is the one that we have a local copy of, this (sometimes unneccessary) round trip to the server takes ~1 s.
The first bit of information we will usually want from a record is what cross-section measurements are contained within it:
$ nuis-hepdata --nuisancedb ./database get-cross-section-measurements hepdata-sandbox:1713531371v1
cross_section-offaxis
cross_section-onaxis
cross_section-onoffaxis
This record contains three measurements that we might want to compare to. We probably want to know the independent and dependent variables defined by each measurement, we can request those. Note that now the measurement table name must be included in the reference:
$ nuis-hepdata --nuisancedb ./database get-independent-vars hepdata-sandbox:1713531371v1/cross_section-onaxis
p_mu
cos_theta_mu
$ nuis-hepdata --nuisancedb ./database get-dependent-vars hepdata-sandbox:1713531371v1/cross_section-onaxis
cross_section
cross_section_neut-prediction
A lot of useful metadata is stored in the qualifiers of dependent variables, we can also query those:
$ nuis-hepdata --nuisancedb ./database get-qualifiers hepdata-sandbox:1713531371v1/cross_section-onaxis:cross_section
cos_theta_mu:projectfunc: analysis.cxx:T2K_CC0Pi_onoffaxis_nu_Project_CosThetaMu
p_mu:projectfunc: analysis.cxx:T2K_CC0Pi_onoffaxis_nu_Project_PMu
p_mu:pretty_name: $p_{\mu}$
probe_flux: flux-onaxis-postfit-fine
selectfunc: analysis.cxx:T2K_CC0Pi_onoffaxis_nu_SelectSignal
target: CH
variable_type: cross_section_measurement
If we want to get the value of a qualifier that we know exists, we can do that too:
$ nuis-hepdata --nuisancedb ./database get-qualifiers hepdata-sandbox:1713531371v1/cross_section-onaxis:cross_section selectfunc
analysis.cxx:T2K_CC0Pi_onoffaxis_nu_SelectSignal
It is often useful to be able to treat the value of a qualifier as a record reference itself and resolve a local path for it,
for example when wanting to concretize the probe flux to use in making a measurement prediction. This can be achieved with the
dereference-to-local-path
command, as demonstrated below:
$ nuis-hepdata --nuisancedb ./database dereference-to-local-path hepdata-sandbox:1713531371v1/cross_section-onaxis:cross_section probe_flux
./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/flux-onaxis-postfit-fine.yaml
Some qualifiers can contain a comma-separated-list of references, these will each separately be resolved to a local path.
$ nuis-hepdata --nuisancedb ./database dereference-to-local-path hepdata-sandbox:1713531371v1/cross_section-onoffaxis sub_measurements
./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/cross_section-offaxis.yaml
./database/hepdata-sandbox/1713531371/HEPData-1713531371-v1/cross_section-onaxis.yaml
Weep profusely. To Write...
In he meantime, reach out to nuisance-owner@projects.hepforge.org or ask on nuisance-xsec.slack.com.