Ostap offers set of utlities to reweight the distributions. Typical use-case is
- one has set of data distributions
- and simulation does not describe these distributions well, and one needs to reweight simulation to describe all distributions
It is relatively easy procedure in Ostap, however it requires some code writing.
First, one needs to specify data distributions. It can be done in form of 1D,2D and 3D histograms, or as 1,2, or 3-argument functions or even all these techniques could be used together. It is important that these data distributions should be strickly positive for the corresponding range of variables. E.g. in case of histograms, there should be no empty or negative bin content.
hdata_x = ROOT.TH1D ( ... ) ## e.g. use the histogram
hdata_x = lambda x : math.exp (-x/10 ) ) ## or use a function
...
hdata_y = ...
Second, for each data distribution one needs to prebook the corresponding template histogram that will be filled from simulated sample. This template histogram should have the same dimensionality (1,2,3) and the corresponidg data distribtion. If data distribution is specified in a form of historgam, the edges of prebooked template histogram should correspond to the edges of data distribution, but there is no requirements for binning. Binning could be arbtrary, provided that there are no empty bins.
hmc_x = ROOT.TH1D ( ... )
hmc_y = ....
Third, one needs to create empty database where the iterative weights are stored:
import Ostap.ZipShelve as DBASE
dbname = 'weights.db'
with DBASE.open( dbname ,'c') as db :
pass
Since Reweighting is essentially iterative procedure, we need to define some maximal number of iterations
iter_max = 10
for iter in range(iter_max) :
...
And for each iteration we need to create weighting object, that reads the current weights from database weight.db
from Ostap.Reweighting import Weight
weightings = [
## accessor function address indatabase
Weight.Var ( lambda s : s.x , 'x-reweight' ) ,
...
]
weighter = Weight ( dbname , weightings )
{% discussion "What is it?" %}
The accessor function is used to get the variable from simulated sample. E.g. in this form, TTree
/ TChain
/RooDataSet
/RooArgSet
can be used as source of simulated data.
but it could be also e.g. some table, numpy
array or any other storage. In this case the accessor function needs to be modified accordingly. The second parameter specify the location in (newly created empty) database, where the current weights are to be taked from. Since the newly created database is empty, for the first iteration all weights are trivial and equal to 1:
mc_tree = ...
for i in range(100):
mc_tree.GetEntry(i)
print ' weight for event %d is %s' % ( i , weighted ( mc_tree ) )
{% enddiscussion %}
As the next step one needs to prepare simulated dataset, RooDataSet
, that
- contains all varables for reweighting
- the current values of weights, provided by
weighter
-object above
There are many ways to achive this. E.g. one can use SelectorWithVars
-utility to decode data from input TTree
/TChain
into RooDataSet
:
from Ostap.Selectors import SelectorWithVars, Variable
## variables to be used in MC-dataset
variables = [
Variable ( 'x' , 'x-var' , 0 , 20 , lambda s : s.x ) ,
...
Variable ( 'weight' , 'weight' , accessor = weighter )
]
## create new "weighted" mcdataset
selector = SelectorWithVars (
variables ,
'0<x && x<20 && 0<y && y<20'
)
## process
mc_tree.process ( selector )
mcds = selector.data ## newly created simulated dataset
print mcds
At the next step we calculate the updated weights and store them in database
from Ostap.Reweighting import makeWeights, WeightingPlot
plots = [
## what how where data simulated-template
WeightingPlot ( 'x' , 'weight' , 'x-reweight' , hdata_x , hmc_x ) ,
...
]
## calculate updated weights and store them in database
more = makeWeights ( mcds , plots , dbname , delta = 0.01 ) ## <-- HERE
The object WeightingPlot
defines the rule to fill simulated histogram from simulated dataset and associated the filled simulated histogram with data distribution. The actual correction to the weights is calculated according to the rule w = dd / mcd
, where dd
is a density for the data distribution and mcd
is a density for simulated distribution. The weights w
are calculated for each entry in array plots
, they are properly normalized and stored in database dbname
to be used for the next iteration. The function makeWeights
also print the statistic of normalized weights:
# Ostap.Reweighting INFO Reweighting: ``x-reweight'': mean/(min,max): (1.00+-0.00)/(0.985,1.012) RMS:(0.74+-0.00)[%]
The last entries in this row summarize the statistics of corrections to the current weight. In this example, the mean correction is 1.00
, the minimal correction is 0.985
, the maximal correction is 1.012
and rms for corrections is 0.74\%
. In tihs example one sees that for this paricualr iteration th ecorrections are rather small, and probably one can stop iterations. Two parameters delta
and minmax
of makeWeights
function allows to automatized th emakinnng the decison. If calculated rms for all corrections is less than specified delta
parameter and for each correction minnmax-difference deos not exceeed the specified minmax
-parameter (the default value is 0.05
), function return False
(meaning no more iterations are needed), otherwise it returns True
. And using this hint one can stop iterations or go further:
if not more and iter > 2 :
print 'No more iteratinos are needed!'
break
In practice it is useful (and adviseable) to compare the data and simulated distributions at each iteration to hjave better control over the iteration process. One can make this comparion using zillions of the ways, but for the most imnportant case in practice, where data distribution is specified in a form of histogram, there are some predefined utilities
## prepare simulated distribution with current weights:
mcds.project ( hmc_x , 'x' , 'weight' )
## compare the basic properties: mean, rms, skewness and kurtosis
hdata_x.cmp_prnt ( hmc_x , 'DATA' , 'MC' , 'DATA(x) vs MC(x)' )
## calculate the ``distance``:
dist = hdata_x.cmp_dist ( hmc_x , density = True )
print "DATA(x)-MC(x) ``distance'' %s" % dist
## calculate the 'orthogonality'
cost = hdata_x.cmp_cos ( hmc_x , density = True )
print "DATA(x)-MC(x) ``orthogonality'' %s" % cost
## find the points of the maximal difference
mn,mx = hdata_x.cmp_minmax ( hmc_x , diff = lambda a,b : a/b , density = True )
print "DATA*(x)/MC(x) ``min/max-distance''[%%] (%s)/(%s) at x=%.1f/%.1f" % (
(100*mn[1]-100) , (100*mx[1]-100) , mn[0] , mx[0] )
from Ostap.Reweighting import Weight
weightings = [
## accessor function address indatabase
Weight.Var ( lambda s : s.x , 'x-reweight' ) ,
...
]
weighter = Weight ( dbname , weightings )
mc_tree = ...
for i in range(100):
mc_tree.GetEntry(i)
print ' weight for event %d is %s' % ( i , weighted ( mc_tree ) )
Note that due to explicit specification of accessor function, reweighter can be customised to work with any type of input events/records. e/g/ assuem that event is a plain array,
and x
-variable corresponds to index 0
:
from Ostap.Reweighting import Weight
weightings = [
## accessor function address indatabase
Weight.Var ( lambda s : s[0] , 'x-reweight' ) ,
...
]
weighter = Weight ( dbname , weightings )
mc_tree = ...
for event in events :
print ' weight for event %s is %s' % ( event , weighted ( event ) )
{% discussion "Abstract reweighting" %}
Due to the freadom in choosing the accessor function, one can apply reweighting procedure to the absolutely abstract samples. E.g. consider the follwing case
- data distribution : simple function
- simulated sample : random number generator
As a result of reweighting procedure, we'll get reweighted simulated sample, that will be just a random number generator, that produces the weighted distribution according to the specified function. For this case, the code is very transparent and compact:
# =============================================================================
## 1) ``DATA distribution'' - plain function
def data ( x ) :
return 0.5 + math.sin ( x * math.pi )**2
# =============================================================================
## 2) ``simulation template'' - histogram template for simulated sample
mc_hist = ROOT.TH1F ( 'hMC', '', 20 , 0 , 1 )
# =============================================================================
def mc_sample () :
x = random.expovariate ( 1 )
while x > 1 : x -=1
return x
# =============================================================================
## 3) create empty database with initial weights
# =============================================================================
import Ostap.ZipShelve as DBASE
if os.path.exists ( dbname ) : os.remove ( dbname )
with DBASE.open( dbname ,'c') as db :
And then one starts iterations:
# ============================================================================
## 4) prepare reweigthing iterations
# ===========================================================================
from Ostap.Reweighting import Weight, makeWeights, WeightingPlot
from Ostap.Selectors import SelectorWithVars, Variable
for iter in range ( 100 ) :
## accessor address in DB
weighter = Weight( dbname , ( Weight.Var ( lambda x : x , 'x-reweight' ) , ) )
## create ``weighted'' simulated dataset using the current weights
selector = SelectorWithVars (
selection = '1<2' , ## fake one :-( to be removed soon
silence = True ,
variables = [ Variable ( 'x' , 'x-var' , 0 , 1 , lambda x : x ) ,
Variable ( 'weight' , 'weight' , accessor = weighter ) ] )
for i in range ( 1000000 ) :
x = mc_sample ()
selector ( x )
mcds = selector.data
## update weights: the rule to create weighted simulated histogram
plots = [ WeightingPlot ( 'x' , 'weight' , 'x-reweight' , data , mc_hist ) ]
## calculate the updated weights and add them into database
more = makeWeights ( mcds , plots , dbname , delta = 0.01 )
if not more and 2 <= iter :
logger.info ( 'No more iterations are needed #%d' % iter )
break
The full example for abstract reweighting, is accessible here
The density distribution for the simulated sample for before the first (blue open squares) and after the last (filled red points) iterations are shown here, while the comparison of the initial data distribution (red line) and the reweighted simulated sample (greed filled diamonods) are shown here.
{% enddiscussion %}
{% discussion "Why one needs iterations?" %} One can argue that low-dimension reweigthing can be done withoky iterations, just in one-go. Why one needs iterations here?
The answer is rather simple: yes for very simple case, like 1D-reweigthing, already the first iteration should provide the exact result. However it is true only if data dsitribution is suppleds and the historgam and the template for the simulated histogram has the same binning. Otherwise the differnt binning scheme results in non-exact result for 1-step reweighting.
For multidimensional reweighting one can avoid iteration only if all innvolved variables are totally uncorrelated, otherwise the iterative procedure is unavoidable.
Moreover in the presense of correlations oscillation effect could occur, that prevents the quick convergency of the iterative procedure. To solve this problem, makeWeighted
-function fior multidimensional case actually under-correct the results. It increases the number of nesessary iterations and make the reweighting procedure more slow, but itpractially eliminates the oscillation effect
{% enddiscussion %}
The example of simple 1D-reweighting can be inspected here, while the reweigthing result for the last iteration (blue open squares) are compared with data distribution (red filled circled) here:
The example also illustrates how to use various histogram comparison functions to have better control over the iterative process
The example of advanced 2D-reweighting can be inspected here.
In this example we have three data distributions fro two variables
1 one-dimensional x
-distribution with fine binninig
1 one-dimensional y
-distribution with fine binninig
1 two-dimensional y:x
-distribution with coarse binning
It reflects relatively frequent case of kinematic reweighting using the transverse momentum and rapidity. Typically one has enough events to make fine-binned one-dimensional reference distributions, but two-dimensional distributions can be obtained only with relatively coarse binning scheme.
Simulated sample is a simlpe 2D-uniform distribution. Note that the data distributions are non-factorizeable, and simple 1D-reweightings here is not enought. In this example, for the first five iteration only 2D-reweighting y:x
is applied, and then two 1D-reweighting x
and y
are added.
After the reweighting the simulated distributins are