Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bounds estimation for produced photons / electrons #66

Open
JelleAalbers opened this issue Mar 23, 2020 · 4 comments
Open

Bounds estimation for produced photons / electrons #66

JelleAalbers opened this issue Mar 23, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@JelleAalbers
Copy link
Member

If you set max_sigma high enough, flamedisx will give accurate results regardless of how bad the bounds estimation is. Empirically we found that max_sigma 5 is reasonable for our current bound estimation.

It would be good to redo the derivations of the bound estimation formulas, and see if we can improve them. This would improve our speed / accuracy profile.

@JelleAalbers JelleAalbers added the enhancement New feature or request label Mar 23, 2020
@JelleAalbers JelleAalbers changed the title Bounds estimation derivation and improvements Bounds estimation for produced photons / electrons Aug 8, 2020
@JelleAalbers
Copy link
Member Author

Specifically, the problem is estimating the number of produced [photons/electrons] N given a known detection efficiency p and number of [photons/electrons] detected, k.

There is a stackexchange post on this here. The accepted answer suggest a procedure involving the negative binomial. Unless I made a mistake while implementing it, it has unacceptably high false exclusions (low coverage) for low N and p. This is for 99% intervals:

stackoverflow

You could also imagine an oversimplified solution like this:

  • Start with the N for which the observed k is the most likely outcome Nm = k/p
  • Find the relative error on k-values, given a binomial with that N: relative_error = mu /sigma = Nm p / sqrt(Nm p (1-p)
  • Assume this is also the relative error of Nm; i.e. report the interval Nm * (1 +- max_sigma * relative_error) on N.
    The performance for 99% intervals is similar to the stackoverflow result:

oversimplified_gaussian_argument

Currently, flamedisx uses a mysterious equation here, leading to an interval of Nm +- max_sigma * (q + (q**2 + 4 * Nm * q)**0.5)/2, with q=1/p. I'd have to dig up how this was derived; likely in a handwavy / semi-empirical / let's-add-factors-of-two-until-it-works way. However, it does seem to be sufficiently conservative for low p-values:

mystery_equation

However, you can see the asymptote is no longer correct. For higher p-values the mystery interval is much too conservative -- the false exclusion rate is in the 10^-4 range around N=4, then drops even more. This makes flamedisx slower, but at least its values are not wrong.

For the moment, we could keep the mystery equation, but it is a bit embarrassing and makes things unnecessarily slow. Perhaps we could instead use one of the other motivated solutions, with an 'empirical fix' for the low p, low N regime? Finally, an option is to compute and cache exact confidence intervals using the Neyman construction in the low (N, p) regime. We'd have to do it for several p's and interpolate somehow. Fortunately, this is all in the numpy part of flamedisx, specifically the annotation stage, so we can be creative.

@pelssers
Copy link
Member

We should definitely redo the derivation of the mystery formula, but there might be a simple fix in the meantime.
Setting q=(1-p)/p in the mystery formula (inspired by reading the stackoverflow post you linked) instead of the original q=1/p I get the below result:
mystery_equation_solved

pelssers added a commit that referenced this issue Aug 11, 2020
@pelssers pelssers mentioned this issue Aug 14, 2020
@JelleAalbers
Copy link
Member Author

JelleAalbers commented Aug 9, 2021

Believe it or not, but exactly a year after I wrote the long post above, the derivation of the mystery equation has washed back up from the seas of time...

Let me try to translate, but remember it's a hand-wavy argument, though apparently somewhat successful:

  • A binomial(N, p) has variance / mean = (N p (1-p)) / N p = (1-p) = q
  • We're after the inverse problem, estimating N given an observed k. There are two issues:
    A. For a given N, the possible k values are spread out;
    B. Our best estimate of N (Nm = k/p) has an error (which we want to find out)
  • We know A gives a variance / mean of q, i.e. σ^2 = mean * q
  • To consider B, we replace mean -> mean + σ to find the +1 sigma bound to get σ^2 = (mean + σ) * q
  • Solving for σ, σ = 0.5 (q +- sqrt(q^2 + 4 mean q)); we can discard the - sign solution since σ must be positive.

This is indeed the 1 sigma bound used in the mystery equation, assuming I made a typo in coding up q=1/p instead of q=1-p. Why q = (1-p)/p = 1/p-1 worked better for @pelssers I have no idea; maybe q=1-p is even better ;-)

I also didn't consider that the bounds might be asymmetric. Repeating the exercise with mean -> mean - σ gives σ = 0.5 (-q +- sqrt(q^2 + 4 mean q)); clearly the + sign gives a positive σ, but one which is smaller than for the positive case.

@CristianAntochi
Copy link
Collaborator

Hi @JelleAalbers!
I think I encountered a similar problem in a textbook exercise when computing the 1sigma confidence belt for a binomial p. I think it makes sense. However with the toyMCs for that exercise I too got at some point q~1/p and I think it comes from an inversion related to Nm=k/p.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants