The main inspiration for the thesis heavily relies on the research performed at the ALICE Collaboration (Particle identification in ALICE: a Bayesian approach, Bayesian Approach for Combined Particle Identification in Alice Experiment at LHC).
- Abstract
- Introduction: Standard Model, flavour physics, ultra-rare decays, CP-Violation
- Belle 2
- Experiment: About the experiment itself
- Detector system
- Interaction with matter: particle identification -- from the initial event to a final PID
- Statistics for particle analysis
- Classification functions: FPR, TPR, PPV, ...
- Receiver operating characteristic
- Identification efficiencies: confusion matrix
- Likelihood: likelihood ratio, Neyman-Pearson
- Neural network
- Application
- Data sample
- Particle identification variables: inventory
- Legacy PID
- Global PID
- Goodness of the global PID: Default approach for upcoming releases, Neyman-Pearson
- Bayesian Approach
- Simple Bayes
- Univariate Bayes
- Multivariate Bayes
- Comparison
- Summary and outlook
- Neural network
- Design: dense layers, drop-out, classification, activation, optimizers
- Performance
- Comparison
- Summary and outlook
- Conclusion
- Computer code
Calculate a sensible ROC figureCalculate and visualize the epsilon_PID matrixVisualize precision and recallEvaluate various "pid*" variables and reconstruct, e.g. the 'pionID' (seebasf2 variables.py
section 'PID', e.g. variable 'pidLogLikelihoodValueExpert')Construct a naive alternative to the particle ID variable using Bayes and e.g. a flat priorCompare the naive approach with the particle ID variable valueAllow for variable dependant priors for the bayesian approach, e.g. make the priors depend on the transverse momentumCompare the dependable priors bayesian approach with the simple bayesian approach and the particle ID oneAllow for multivariate a prior probabilities on two variablesCompare the multi-dependant priorsIntroduce an alternative selection methods, i.e. maximum probabilityUse different sets of data to diversify resultsAdd a simple Machine Learning approach using the likelihoods as input and the prediction as output- Design a functional Neural Network classifying particles
Utilize automatically selected ROOT variables as input, e.g. depending on their contribution to a PCAUse a multitude of variables and let the network decideVisually compare the approachesSample the data in a way which makes all particles equally likelyPre-train neural network on e.g. data sets containing all particles in the same abundanceTweak pre-trained model for decay specific particle abundances
Plot efficiencies over background-ratio for PID, Bayes, etc.Visualize the difference in assumed particle abundances for various approachesAbundance comparisons differentiating by correctly and faulty reconstructed ones, e.g. in a stacked histogram- Efficiency and contamination over transverse momentum plot
- Output efficiency tables for various approaches, including e.g. the Area Under Curve for PPV and ROC
Confusion matrices of the identification processpidProbability
maybe is not set for certain detectors above a specific threshold and therefore screws the resultKink in proton ID could be caused by the addition of further detectorsSamplepidProbability
bypt
andTheta
bins and then perform the test based on the Neyman-Pearson lemmaVisualize plot for each holdUse a density based approach for visualization of both
Idea do manifest that we are using bins: xerr bar in neyman-pearsonLook up where the detector is and whether it explains the effect seen in the Multi-Axis histogram→ CDC and TOP is reached at0.28 < pt < 0.52
; Error in CDC detector (by Thomas);End of coverage of CDC at-0.5
cosTheta
(my objection)Retrieve the proton's point of origin and apply strict cuts (mcTruth of vertex)→ There are no proton tracks withisPrimarySignal
Share confusion matrix codeInvestigate alternative detector calibrationsRegenerate charged MC9 data using an altered detector configurationRerun the analysis on the generated dataValidate the new graphs by rerunning the analysis on generated data with the old calibration
Add shell completion using bash stuffAdd same completion for bashBox plot outliersStrip outliers and plot ratesInvestigate why epsilon_PID does not sum to one→ Cut is not exclusiveSelect by thetaWhy does the epsilon_PID contain values for the deuteronsTry out seaborn for plotting with custom themes to improve styleMakestats
honor particles_of_interestFancier argparse list handlingDo not choose arbitrary cuts for the diff of epsilonPIDs but instead e.g. try to have both categorize the same number of values as true?--> Better: maximum probability cutRename chunked to univariate or multivariatePlot particle abundance as histogramInvestigate the possibility of results being screwed by NaN values in priors→ Guess they are okay; If the value is nan then there is no data available to compare the priors toFixup & Improve BayesFind alternatives to and strip data selection by whiskers→ Disable whiskers by default and drop un-physical values prior to precessing anythingLook into stats ratios, as they are not in compliance with the given ratesIs the PID variable simply defined for more rows; I.e. is the denominator including rows with NaN values→ YesCheck that flat_bayes is actually identical to the ROOT variable→ Both are more or less equal; Considering the numerical precision the result is probably finepidProbabilityExpert
Investigate particles pairs and Anti-Particles treatment
Keep the window size for stat plots fixed between 0 and 1 for the x- as well as the y-axisShare the colorbar between plots in diff_epsilon instead of just reusing the lastDiff stats only for range in between minimal and maximal values and no interpolation beyond the dataFix path joining by not relying on '/' being the default delimiterRemove title from figure when savingPlot Neyman-Pearson for different cutting columns, e.g. the Bayesian columns- Bear in mind that it is not straight-forward to compare those graphsPlot Neyman-Pearson for each particle and anti-particle separatelyUse pandas' query attribute instead of manual selection→ Utilized where appropriatePut the font of the label of the epsilonPID cells on a grey-scale depending on their backgroundCut data to ignoremcPDG = 0
(not our beer to remove misidentified particles),z0 < 5cm
,d0 < 2cm
andpt < 50 MeV/c
Plot number of hits formcPDG = 0
for various detectors (ask Jakob) as motivation for excluding those tracksMakebar_particles
configurable but keep the non-bar part of it even when 'bared'Calculate the overall accuracy of an approach (especially useful for comparisons with a neural network)- Save the calculated columns from event_metrics to a pickle file & Check for columns existence prior to calculating the values anew (configurable via '--force')
- ? Utilize pandas' multi-indexing instead of nesting the DataFrames into a dictionary
Start by inputting logLikelihoods and maybe add a sum function as nodeAllow the adaption of theepochs
,training_fraction
, etc. variableInclude the NN model in the ParticleFrame and save it into the pickle→ Bad idea as it will unnecessarily clutter the file with unrelated dataAdd model-saving checkpoints as callback
- Selecting pidProbabilities for the
SVD
detector
d = 'svd'
c = {k: 'pidProbabilityExpert__bo' + basf2_Code(k) + '__cm__sp' + d + '__bc' for k in particles}
title_suffixes += [' via SVD PID']
- Library
def __init__(self, particles=None, particle_base_formats=None, detectors=None, pseudo_detectors=None, detector_weights=None, physical_boundaries=None):
"""Base definitions of stable particles and detector components to be included in calculations."""
self.particles = ['K+', 'pi+', 'e+', 'mu+', 'p+', 'deuteron'] if particles is None else particles
self.detectors = ['pxd', 'svd', 'cdc', 'top', 'arich', 'ecl', 'klm'] if detectors is None else detectors
self.pseudo_detectors = ['all', 'default'] if pseudo_detectors is None else pseudo_detectors
# Use the detector weights to exclude certain detectors, e.g. for debugging purposes
# Bear in mind that if all likelihoods are calculated correctly this should never improve the result
self.detector_weights = {d: 1. for d in ['pxd', 'svd', 'cdc', 'top', 'arich', 'ecl', 'klm', 'all', 'default']} if detector_weights is None else detector_weights
# Dictionary of variables and their boundaries for possible values they might yield
self.physical_boundaries = {'pt': (0, 5.5), 'cosTheta': (-1, 1)} if physical_boundaries is None else physical_boundaries
- Search for differences in pidProbabilities for particles and ant-particles (Bear in mind
np.nan == np.nan
isFalse
) → There are none!
for p, anti_p in zip(particles, particles_bar):
print(pf[p].dropna()[pf[p]['pidProbabilityExpert__bo' + lib.basf2_Code(p) + '__cm__sp' + d + '__bc'] != pf[p]['pidProbabilityExpert__bo' + lib.basf2_Code(anti_p) + '__cm__sp' + d + '__bc']][['pidProbabilityExpert__bo' + lib.basf2_Code(p) + '__cm__sp' + d + '__bc', 'pidProbabilityExpert__bo' + lib.basf2_Code(anti_p) + '__cm__sp' + d + '__bc']])
- Motivation for excluding
mcPDG = 0
nbins = 10; ParticleFrame = lib.ParticleFrame
# NOTE: This will taint the class
ParticleFrame.physical_boundaries = {'0.05 < pt < 5.29', 'z0 < 5', 'd0 < 2'}
data = ParticleFrame()
data.read_root('./charged.root/task01')
for p, detector_hits in itertools.product(ParticleFrame.particles, ['nCDCHits', 'nPXDHits', 'nSVDHits', 'nVXDHits']):
plt.figure()
plt.hist(data[p].query('mcPDG == 0')[detector_hits], nbins, normed=True, histtype='step', label='False (# %d)'%(data[p].query('mcPDG == 0')[detector_hits].shape[0]))
plt.hist(data[p].query('mcPDG != 0')[detector_hits], nbins, normed=True, histtype='step', label='True (# %d)'%(data[p].query('mcPDG != 0')[detector_hits].shape[0]))
plt.legend()
title = detector_hits + ' for ' + ParticleFrame.particle_base_formats[p]
sanitized_title = re.sub('[\\\\$_^{}]', '', 'General Purpose Statistics: ' + title + '')
plt.savefig(os.path.join('./doc/res/charged 01/', sanitized_title + '.' + 'pdf'), bbox_inches='tight', format='pdf')
plt.title(title)
plt.show(block=False)
plt.show()
- Receiver Operating Characteristic curve
x = np.linspace(0., 1., 1000)
title = 'Sample Receiver Operating Characteristic (ROC) curve'
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.plot(x, x, linestyle='--', label='Guessing')
plt.plot(x, x**4, label='Worse')
plt.plot(x, x**(1/4), label='Better')
plt.legend()
plt.savefig(os.path.join('./doc/res/', title + '.' + 'pdf'), bbox_inches='tight', format='pdf')
plt.title(title)
plt.show()
- Neyman-Pearson visualization
x = np.linspace(0.05, 0.95, 10)
y = np.exp(x)
title = 'Neyman-Pearson Visualization'
plt.xlabel(r'$\mathcal{LR}(\pi)$')
plt.ylabel(r'$\frac{\#\pi}{\sum_{a}{\# A}}$')
plt.errorbar(x, y/y.max(), xerr=0.05, capsize=0., elinewidth=1, marker='o', markersize=4, markeredgewidth=1, markerfacecolor='None', linestyle='--', linewidth=0.1)
plt.savefig(os.path.join('./doc/res/', title + '.' + 'pdf'), bbox_inches='tight', format='pdf')
plt.title(title)
plt.show()
- Overnight NTuples creation
for dir in '.' 'y2s.root' 'y2s.root/noise/task01' 'charged.root/task01' 'charged.root/task02' 'charged.root/task03' 'charged.root/task04' 'mixed.root/task01' 'mixed.root/task02' 'mixed.root/task03' 'mixed.root/task04'; do
(echo ${dir}; cd ${dir} && ls -AlFh *mdst*.root && basf2 ~/Work/PID-boost/event_preprocessing.py -p4 -i "*mdst*.root")
done; \
- Overnight event-metrics job scheduling
PARAMS=('--info' '--mc-best' '--exclusive-cut' '--nbins=10' '--ncuts=40' '--holdings' 'pt' 'cosTheta' '--norm' 'pi+');
tasks=('--multivariate-bayes-motivation' '--multivariate-bayes' '--univariate-bayes-outliers' '--niterations=0 --univariate-bayes-priors' '--hold cosTheta --niterations=0 --univariate-bayes-priors' '--univariate-bayes-priors' '--univariate-bayes' '--bayes' '--pid' '--stats' '--bar-particles --pidProbability' '--bar-particles --norm p+ --pidProbability vertex' '--bar-particles --norm p+ --pidProbability primary' '--pidProbability-motivation' '--diff pid flat_bayes' '--diff pid simple_bayes' '--diff pid univariate_bayes' '--diff pid multivariate_bayes' '--diff pidProbability flat_bayes' '--diff pidProbability simple_bayes' '--diff pidProbability univariate_bayes' '--diff pidProbability multivariate_bayes' '--diff flat_bayes simple_bayes' '--diff flat_bayes univariate_bayes' '--diff flat_bayes multivariate_bayes' '--diff simple_bayes univariate_bayes' '--diff simple_bayes multivariate_bayes' '--diff univariate_bayes multivariate_bayes' '--diff univariate_bayes_by_cosTheta multivariate_bayes');
for i in "${tasks[@]}"; do >&2 echo "$(echo ${i})" && event_metrics.py --non-interactive -i . -o ./doc/res/"sample" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i y2s.root -o ./doc/res/"y2s" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do >&2 echo "$(echo ${i})" && event_metrics.py --non-interactive -i charged.root/task01 -o ./doc/res/"charged 01" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i mixed.root/task01 -o ./doc/res/"mixed 01" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i charged.root/task02 -o ./doc/res/"charged 02" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i mixed.root/task02 -o ./doc/res/"mixed 02" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i charged.root/task03 -o ./doc/res/"charged 03" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i mixed.root/task03 -o ./doc/res/"mixed 03" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i charged.root/task04 -o ./doc/res/"charged 04" ${PARAMS[@]} $(echo ${i}); done; \
tasks=('--pid' '--stats' '--bar-particles --pidProbability' '--bar-particles --norm p+ --pidProbability vertex' '--bar-particles --norm p+ --pidProbability primary' '--pidProbability-motivation')
for i in "${tasks[@]}"; do >&2 echo "$(echo ${i})" && event_metrics.py --non-interactive -i unset.root -o ./doc/res/"unset" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do >&2 echo "$(echo ${i})" && event_metrics.py --non-interactive -i set.root -o ./doc/res/"set" ${PARAMS[@]} $(echo ${i}); done; \
- Overnight neural network training
NN_PARAMS=('--info' '--add-dropout' '--batch-size=256' '--training-fraction=0.80' '--ncomponents=70' '--epochs=25' '--gpu=0' '--log-dir=/dev/null');
nn_tasks=('--sampling-method fair --pidProbability' '--sampling-method biased --pidProbability' '--sampling-method fair --all' '--sampling-method biased --all');
nn_optimizer_methods=('rmsprop' 'sgd' 'adagrad' 'adadelta' 'adam' 'adamax' 'nadam');
for i in "${nn_tasks[@]}"; do
for n in "${nn_optimizer_methods[@]}"; do
event_nn.py --non-interactive -i charged.root/task01 -o /srv/data/$(id -un)/res/"charged 01" ${NN_PARAMS[@]} --optimizer $(echo ${n}) $(echo ${i});
done;
done; \
for i in "${nn_tasks[@]}"; do
for n in "${nn_optimizer_methods[@]}"; do
event_nn.py --non-interactive -i mixed.root/task01 -o /srv/data/$(id -un)/res/"mixed 01" ${NN_PARAMS[@]} --optimizer $(echo ${n}) $(echo ${i});
done;
done; \
NN_PARAMS=('--info' '--add-dropout' '--batch-size=256' '--training-fraction=0.80' '--ncomponents=90' '--epochs=25' '--gpu=0' '--log-dir=/dev/null');
nn_tasks=('--sampling-method fair --pca' '--sampling-method biased --pca');
nn_optimizer_methods=('adadelta' 'adamax');
for i in "${nn_tasks[@]}"; do
for n in "${nn_optimizer_methods[@]}"; do
event_nn.py --non-interactive -i charged.root/task01 -o /srv/data/$(id -un)/res/"charged 01" ${NN_PARAMS[@]} --optimizer $(echo ${n}) $(echo ${i});
done;
done; \
- Neural network applying models and calculating metrics
NN_PARAMS=('--info' '--add-dropout' '--batch-size=256' '--training-fraction=0.80' '--ncomponents=70' '--epochs=25' '--gpu=0' '--log-dir=/dev/null');
event_nn.py --non-interactive -i charged.root/task04/ -o res/"charged 04" ${NN_PARAMS[@]} --apply --input-module "model_checkpoint_pca_ncomponents90_biased_nAdditionalLayers1_Optimizeradamax_LearningRateNone_nEpochs25_BatchSize256.h5" --input-pca "pca_ncomponents90.pkl"
PARAMS=('--info' '--mc-best' '--exclusive-cut' '--nbins=10' '--ncuts=40' '--holdings' 'pt' 'cosTheta' '--norm' 'pi+');
tasks=('--diff pid nn' '--diff pidProbability nn' '--diff flat_bayes nn' '--diff univariate_bayes nn' '--diff multivariate_bayes nn')
for i in "${tasks[@]}"; do
>&2 echo "$(echo ${i})" \
&& event_metrics.py --non-interactive -o doc/res/"charged 04" ${PARAMS[@]} --input-pickle "res/charged 04/ParticleFrame_applypca_ncomponents90_biased_nAdditionalLayers1_Optimizeradamax_LearningRateNone_nEpochs25_BatchSize256.pkl" $(echo ${i});
done; \
- Neural network model visualization
for i in res/"sample"/history_*.pkl; do event_model.py --non-interactive -o doc/res/"sample" --history-file "${i}"; done; \
for i in res/"y2s"/history_*.pkl; do event_model.py --non-interactive -o doc/res/"y2s" --history-file "${i}"; done; \
for i in res/"charged 01"/history_*.pkl; do event_model.py --non-interactive -o doc/res/"charged 01" --history-file "${i}"; done; \
for i in res/"mixed 01"/history_*.pkl; do event_model.py --non-interactive -o doc/res/"mixed 01" --history-file "${i}"; done; \
- MC9 charged event generation
source ~/setup_belle2.sh release-01-00-03
(unset BELLE2_CONDB_GLOBALTAG; for n in {10..20}; do &>unset_$n.log basf2 generate_charged.py -n 10000 -o unset_$n.mdst.root &; done)
(export BELLE2_CONDB_GLOBALTAG="Calibration_Offline_Development"; for n in {10..20}; do &>set_$n.log basf2 generate_charged.py -n 10000 -o set_$n.mdst.root &; done)