Skip to content

Bachelor Thesis: Optimization of Particle Identification (at the Belle 2 Experiment)

License

Notifications You must be signed in to change notification settings

Edenhofer/PID-boost

Repository files navigation

Playground repository for boosted particle identification

Inspiration

The main inspiration for the thesis heavily relies on the research performed at the ALICE Collaboration (Particle identification in ALICE: a Bayesian approach, Bayesian Approach for Combined Particle Identification in Alice Experiment at LHC).

Outline of the thesis

  1. Abstract
  • Introduction: Standard Model, flavour physics, ultra-rare decays, CP-Violation
  • Belle 2
    • Experiment: About the experiment itself
    • Detector system
    • Interaction with matter: particle identification -- from the initial event to a final PID
  • Statistics for particle analysis
    • Classification functions: FPR, TPR, PPV, ...
    • Receiver operating characteristic
    • Identification efficiencies: confusion matrix
    • Likelihood: likelihood ratio, Neyman-Pearson
    • Neural network
  • Application
    • Data sample
    • Particle identification variables: inventory
      • Legacy PID
      • Global PID
      • Goodness of the global PID: Default approach for upcoming releases, Neyman-Pearson
    • Bayesian Approach
      • Simple Bayes
      • Univariate Bayes
      • Multivariate Bayes
      • Comparison
      • Summary and outlook
    • Neural network
      • Design: dense layers, drop-out, classification, activation, optimizers
      • Performance
      • Comparison
      • Summary and outlook
  • Conclusion
  1. Computer code

Roadmap

  • Calculate a sensible ROC figure
  • Calculate and visualize the epsilon_PID matrix
  • Visualize precision and recall
  • Evaluate various "pid*" variables and reconstruct, e.g. the 'pionID' (see basf2 variables.py section 'PID', e.g. variable 'pidLogLikelihoodValueExpert')
  • Construct a naive alternative to the particle ID variable using Bayes and e.g. a flat prior
  • Compare the naive approach with the particle ID variable value
  • Allow for variable dependant priors for the bayesian approach, e.g. make the priors depend on the transverse momentum
  • Compare the dependable priors bayesian approach with the simple bayesian approach and the particle ID one
  • Allow for multivariate a prior probabilities on two variables
  • Compare the multi-dependant priors
  • Introduce an alternative selection methods, i.e. maximum probability
  • Use different sets of data to diversify results
  • Add a simple Machine Learning approach using the likelihoods as input and the prediction as output
  • Design a functional Neural Network classifying particles
    • Utilize automatically selected ROOT variables as input, e.g. depending on their contribution to a PCA
    • Use a multitude of variables and let the network decide
    • Visually compare the approaches
    • Sample the data in a way which makes all particles equally likely
    • Pre-train neural network on e.g. data sets containing all particles in the same abundance
    • Tweak pre-trained model for decay specific particle abundances
  • Plot efficiencies over background-ratio for PID, Bayes, etc.
  • Visualize the difference in assumed particle abundances for various approaches
  • Abundance comparisons differentiating by correctly and faulty reconstructed ones, e.g. in a stacked histogram
  • Efficiency and contamination over transverse momentum plot
  • Output efficiency tables for various approaches, including e.g. the Area Under Curve for PPV and ROC

Physics performance meeting

  • Confusion matrices of the identification process
  • pidProbability maybe is not set for certain detectors above a specific threshold and therefore screws the result
  • Kink in proton ID could be caused by the addition of further detectors
  • Sample pidProbability by pt and Theta bins and then perform the test based on the Neyman-Pearson lemma
    • Visualize plot for each hold
    • Use a density based approach for visualization of both
  • Idea do manifest that we are using bins: xerr bar in neyman-pearson
  • Look up where the detector is and whether it explains the effect seen in the Multi-Axis histogram → CDC and TOP is reached at 0.28 < pt < 0.52; Error in CDC detector (by Thomas); End of coverage of CDC at -0.5 cosTheta (my objection)
  • Retrieve the proton's point of origin and apply strict cuts (mcTruth of vertex) → There are no proton tracks with isPrimarySignal
  • Share confusion matrix code
  • Investigate alternative detector calibrations
    • Regenerate charged MC9 data using an altered detector configuration
    • Rerun the analysis on the generated data
    • Validate the new graphs by rerunning the analysis on generated data with the old calibration

Bayes, lib and beyond

  • Add shell completion using bash stuff
  • Add same completion for bash
  • Box plot outliers
  • Strip outliers and plot rates
  • Investigate why epsilon_PID does not sum to one → Cut is not exclusive
  • Select by theta
  • Why does the epsilon_PID contain values for the deuterons
  • Try out seaborn for plotting with custom themes to improve style
  • Make stats honor particles_of_interest
  • Fancier argparse list handling
  • Do not choose arbitrary cuts for the diff of epsilonPIDs but instead e.g. try to have both categorize the same number of values as true? --> Better: maximum probability cut
  • Rename chunked to univariate or multivariate
  • Plot particle abundance as histogram
  • Investigate the possibility of results being screwed by NaN values in priors → Guess they are okay; If the value is nan then there is no data available to compare the priors to
  • Fixup & Improve Bayes
    • Find alternatives to and strip data selection by whiskers → Disable whiskers by default and drop un-physical values prior to precessing anything
    • Look into stats ratios, as they are not in compliance with the given rates
    • Is the PID variable simply defined for more rows; I.e. is the denominator including rows with NaN values → Yes
    • Check that flat_bayes is actually identical to the ROOT variable pidProbabilityExpert → Both are more or less equal; Considering the numerical precision the result is probably fine
    • Investigate particles pairs and Anti-Particles treatment
  • Keep the window size for stat plots fixed between 0 and 1 for the x- as well as the y-axis
  • Share the colorbar between plots in diff_epsilon instead of just reusing the last
  • Diff stats only for range in between minimal and maximal values and no interpolation beyond the data
  • Fix path joining by not relying on '/' being the default delimiter
  • Remove title from figure when saving
  • Plot Neyman-Pearson for different cutting columns, e.g. the Bayesian columns - Bear in mind that it is not straight-forward to compare those graphs
  • Plot Neyman-Pearson for each particle and anti-particle separately
  • Use pandas' query attribute instead of manual selection → Utilized where appropriate
  • Put the font of the label of the epsilonPID cells on a grey-scale depending on their background
  • Cut data to ignore mcPDG = 0 (not our beer to remove misidentified particles), z0 < 5cm, d0 < 2cm and pt < 50 MeV/c
  • Plot number of hits for mcPDG = 0 for various detectors (ask Jakob) as motivation for excluding those tracks
  • Make bar_particles configurable but keep the non-bar part of it even when 'bared'
  • Calculate the overall accuracy of an approach (especially useful for comparisons with a neural network)
  • Save the calculated columns from event_metrics to a pickle file & Check for columns existence prior to calculating the values anew (configurable via '--force')
  • ? Utilize pandas' multi-indexing instead of nesting the DataFrames into a dictionary

Neural network

  • Start by inputting logLikelihoods and maybe add a sum function as node
  • Allow the adaption of the epochs, training_fraction, etc. variable
  • Include the NN model in the ParticleFrame and save it into the pickle → Bad idea as it will unnecessarily clutter the file with unrelated data
  • Add model-saving checkpoints as callback

Useful snippets and code examples

Source code snippets

  • Selecting pidProbabilities for the SVD detector
d = 'svd'
c = {k: 'pidProbabilityExpert__bo' + basf2_Code(k) + '__cm__sp' + d + '__bc' for k in particles}
title_suffixes += [' via SVD PID']
  • Library
def __init__(self, particles=None, particle_base_formats=None, detectors=None, pseudo_detectors=None, detector_weights=None, physical_boundaries=None):
    """Base definitions of stable particles and detector components to be included in calculations."""

    self.particles = ['K+', 'pi+', 'e+', 'mu+', 'p+', 'deuteron'] if particles is None else particles
    self.detectors = ['pxd', 'svd', 'cdc', 'top', 'arich', 'ecl', 'klm'] if detectors is None else detectors
    self.pseudo_detectors = ['all', 'default'] if pseudo_detectors is None else pseudo_detectors
    # Use the detector weights to exclude certain detectors, e.g. for debugging purposes
    # Bear in mind that if all likelihoods are calculated correctly this should never improve the result
    self.detector_weights = {d: 1. for d in ['pxd', 'svd', 'cdc', 'top', 'arich', 'ecl', 'klm', 'all', 'default']} if detector_weights is None else detector_weights
    # Dictionary of variables and their boundaries for possible values they might yield
    self.physical_boundaries = {'pt': (0, 5.5), 'cosTheta': (-1, 1)} if physical_boundaries is None else physical_boundaries
  • Search for differences in pidProbabilities for particles and ant-particles (Bear in mind np.nan == np.nan is False) → There are none!
for p, anti_p in zip(particles, particles_bar):
     print(pf[p].dropna()[pf[p]['pidProbabilityExpert__bo' + lib.basf2_Code(p) + '__cm__sp' + d + '__bc'] != pf[p]['pidProbabilityExpert__bo' + lib.basf2_Code(anti_p) + '__cm__sp' + d + '__bc']][['pidProbabilityExpert__bo' + lib.basf2_Code(p) + '__cm__sp' + d + '__bc', 'pidProbabilityExpert__bo' + lib.basf2_Code(anti_p) + '__cm__sp' + d + '__bc']])
  • Motivation for excluding mcPDG = 0
nbins = 10; ParticleFrame = lib.ParticleFrame
# NOTE: This will taint the class
ParticleFrame.physical_boundaries = {'0.05 < pt < 5.29', 'z0 < 5', 'd0 < 2'}

data = ParticleFrame()
data.read_root('./charged.root/task01')

for p, detector_hits in itertools.product(ParticleFrame.particles, ['nCDCHits', 'nPXDHits', 'nSVDHits', 'nVXDHits']):
  plt.figure()
  plt.hist(data[p].query('mcPDG == 0')[detector_hits], nbins, normed=True, histtype='step', label='False (# %d)'%(data[p].query('mcPDG == 0')[detector_hits].shape[0]))
  plt.hist(data[p].query('mcPDG != 0')[detector_hits], nbins, normed=True, histtype='step', label='True (# %d)'%(data[p].query('mcPDG != 0')[detector_hits].shape[0]))
  plt.legend()
  title = detector_hits + ' for ' + ParticleFrame.particle_base_formats[p]
  sanitized_title = re.sub('[\\\\$_^{}]', '', 'General Purpose Statistics: ' + title + '')
  plt.savefig(os.path.join('./doc/res/charged 01/', sanitized_title + '.' + 'pdf'), bbox_inches='tight', format='pdf')
  plt.title(title)
  plt.show(block=False)

plt.show()
  • Receiver Operating Characteristic curve
x = np.linspace(0., 1., 1000)
title = 'Sample Receiver Operating Characteristic (ROC) curve'

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.plot(x, x, linestyle='--', label='Guessing')
plt.plot(x, x**4, label='Worse')
plt.plot(x, x**(1/4), label='Better')
plt.legend()
plt.savefig(os.path.join('./doc/res/', title + '.' + 'pdf'), bbox_inches='tight', format='pdf')
plt.title(title)
plt.show()
  • Neyman-Pearson visualization
x = np.linspace(0.05, 0.95, 10)
y = np.exp(x)
title = 'Neyman-Pearson Visualization'

plt.xlabel(r'$\mathcal{LR}(\pi)$')
plt.ylabel(r'$\frac{\#\pi}{\sum_{a}{\# A}}$')
plt.errorbar(x, y/y.max(), xerr=0.05, capsize=0., elinewidth=1, marker='o', markersize=4, markeredgewidth=1, markerfacecolor='None', linestyle='--', linewidth=0.1)
plt.savefig(os.path.join('./doc/res/', title + '.' + 'pdf'), bbox_inches='tight', format='pdf')
plt.title(title)
plt.show()

Job scheduling

  • Overnight NTuples creation
for dir in '.' 'y2s.root' 'y2s.root/noise/task01' 'charged.root/task01' 'charged.root/task02' 'charged.root/task03' 'charged.root/task04' 'mixed.root/task01' 'mixed.root/task02' 'mixed.root/task03' 'mixed.root/task04'; do
  (echo ${dir}; cd ${dir} && ls -AlFh *mdst*.root && basf2 ~/Work/PID-boost/event_preprocessing.py -p4 -i "*mdst*.root")
done; \
  • Overnight event-metrics job scheduling
PARAMS=('--info' '--mc-best' '--exclusive-cut' '--nbins=10' '--ncuts=40' '--holdings' 'pt' 'cosTheta' '--norm' 'pi+');
tasks=('--multivariate-bayes-motivation' '--multivariate-bayes' '--univariate-bayes-outliers' '--niterations=0 --univariate-bayes-priors' '--hold cosTheta --niterations=0 --univariate-bayes-priors' '--univariate-bayes-priors' '--univariate-bayes' '--bayes' '--pid' '--stats' '--bar-particles --pidProbability' '--bar-particles --norm p+ --pidProbability vertex' '--bar-particles --norm p+ --pidProbability primary' '--pidProbability-motivation' '--diff pid flat_bayes' '--diff pid simple_bayes' '--diff pid univariate_bayes' '--diff pid multivariate_bayes' '--diff pidProbability flat_bayes' '--diff pidProbability simple_bayes' '--diff pidProbability univariate_bayes' '--diff pidProbability multivariate_bayes' '--diff flat_bayes simple_bayes' '--diff flat_bayes univariate_bayes' '--diff flat_bayes multivariate_bayes' '--diff simple_bayes univariate_bayes' '--diff simple_bayes multivariate_bayes' '--diff univariate_bayes multivariate_bayes' '--diff univariate_bayes_by_cosTheta multivariate_bayes');
for i in "${tasks[@]}"; do >&2 echo "$(echo ${i})" && event_metrics.py --non-interactive -i . -o ./doc/res/"sample" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i y2s.root -o ./doc/res/"y2s" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do >&2 echo "$(echo ${i})" && event_metrics.py --non-interactive -i charged.root/task01 -o ./doc/res/"charged 01" ${PARAMS[@]} $(echo ${i}); done; \

for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i mixed.root/task01 -o ./doc/res/"mixed 01" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i charged.root/task02 -o ./doc/res/"charged 02" ${PARAMS[@]} $(echo ${i}); done; \

for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i mixed.root/task02 -o ./doc/res/"mixed 02" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i charged.root/task03 -o ./doc/res/"charged 03" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i mixed.root/task03 -o ./doc/res/"mixed 03" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do event_metrics.py --non-interactive -i charged.root/task04 -o ./doc/res/"charged 04" ${PARAMS[@]} $(echo ${i}); done; \

tasks=('--pid' '--stats' '--bar-particles --pidProbability' '--bar-particles --norm p+ --pidProbability vertex' '--bar-particles --norm p+ --pidProbability primary' '--pidProbability-motivation')
for i in "${tasks[@]}"; do >&2 echo "$(echo ${i})" && event_metrics.py --non-interactive -i unset.root -o ./doc/res/"unset" ${PARAMS[@]} $(echo ${i}); done; \
for i in "${tasks[@]}"; do >&2 echo "$(echo ${i})" && event_metrics.py --non-interactive -i set.root -o ./doc/res/"set" ${PARAMS[@]} $(echo ${i}); done; \
  • Overnight neural network training
NN_PARAMS=('--info' '--add-dropout' '--batch-size=256' '--training-fraction=0.80' '--ncomponents=70' '--epochs=25' '--gpu=0' '--log-dir=/dev/null');
nn_tasks=('--sampling-method fair --pidProbability' '--sampling-method biased --pidProbability' '--sampling-method fair --all' '--sampling-method biased --all');
nn_optimizer_methods=('rmsprop' 'sgd' 'adagrad' 'adadelta' 'adam' 'adamax' 'nadam');

for i in "${nn_tasks[@]}"; do
  for n in "${nn_optimizer_methods[@]}"; do
    event_nn.py --non-interactive -i charged.root/task01 -o /srv/data/$(id -un)/res/"charged 01" ${NN_PARAMS[@]} --optimizer $(echo ${n}) $(echo ${i});
  done;
done; \
for i in "${nn_tasks[@]}"; do
  for n in "${nn_optimizer_methods[@]}"; do
    event_nn.py --non-interactive -i mixed.root/task01 -o /srv/data/$(id -un)/res/"mixed 01" ${NN_PARAMS[@]} --optimizer $(echo ${n})  $(echo ${i});
  done;
done; \

NN_PARAMS=('--info' '--add-dropout' '--batch-size=256' '--training-fraction=0.80' '--ncomponents=90' '--epochs=25' '--gpu=0' '--log-dir=/dev/null');
nn_tasks=('--sampling-method fair --pca' '--sampling-method biased --pca');
nn_optimizer_methods=('adadelta' 'adamax');

for i in "${nn_tasks[@]}"; do
  for n in "${nn_optimizer_methods[@]}"; do
    event_nn.py --non-interactive -i charged.root/task01 -o /srv/data/$(id -un)/res/"charged 01" ${NN_PARAMS[@]} --optimizer $(echo ${n}) $(echo ${i});
  done;
done; \
  • Neural network applying models and calculating metrics
NN_PARAMS=('--info' '--add-dropout' '--batch-size=256' '--training-fraction=0.80' '--ncomponents=70' '--epochs=25' '--gpu=0' '--log-dir=/dev/null');
event_nn.py --non-interactive -i charged.root/task04/ -o res/"charged 04" ${NN_PARAMS[@]} --apply --input-module "model_checkpoint_pca_ncomponents90_biased_nAdditionalLayers1_Optimizeradamax_LearningRateNone_nEpochs25_BatchSize256.h5" --input-pca "pca_ncomponents90.pkl"

PARAMS=('--info' '--mc-best' '--exclusive-cut' '--nbins=10' '--ncuts=40' '--holdings' 'pt' 'cosTheta' '--norm' 'pi+');
tasks=('--diff pid nn' '--diff pidProbability nn' '--diff flat_bayes nn' '--diff univariate_bayes nn' '--diff multivariate_bayes nn')
for i in "${tasks[@]}"; do
  >&2 echo "$(echo ${i})" \
  && event_metrics.py --non-interactive -o doc/res/"charged 04" ${PARAMS[@]} --input-pickle "res/charged 04/ParticleFrame_applypca_ncomponents90_biased_nAdditionalLayers1_Optimizeradamax_LearningRateNone_nEpochs25_BatchSize256.pkl" $(echo ${i});
done; \
  • Neural network model visualization
for i in res/"sample"/history_*.pkl; do event_model.py --non-interactive -o doc/res/"sample" --history-file "${i}"; done; \
for i in res/"y2s"/history_*.pkl; do event_model.py --non-interactive -o doc/res/"y2s" --history-file "${i}"; done; \
for i in res/"charged 01"/history_*.pkl; do event_model.py --non-interactive -o doc/res/"charged 01" --history-file "${i}"; done; \
for i in res/"mixed 01"/history_*.pkl; do event_model.py --non-interactive -o doc/res/"mixed 01" --history-file "${i}"; done; \
  • MC9 charged event generation
source ~/setup_belle2.sh release-01-00-03
(unset BELLE2_CONDB_GLOBALTAG; for n in {10..20}; do &>unset_$n.log basf2 generate_charged.py -n 10000 -o unset_$n.mdst.root &; done)
(export BELLE2_CONDB_GLOBALTAG="Calibration_Offline_Development"; for n in {10..20}; do &>set_$n.log basf2 generate_charged.py -n 10000 -o set_$n.mdst.root &; done)

About

Bachelor Thesis: Optimization of Particle Identification (at the Belle 2 Experiment)

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages