Q12 in Statistics #15

davies-w · 2024-01-14T00:23:36Z

How is this question different from the German Tank Problem, which appears to have a different solution?
I'm curious, as I'm not a trained statistician, but once found a bug in the implementation of the GTP, which was causing a page outage at a certain large internet company, which is how I know of the GTP.

https://en.wikipedia.org/wiki/German_tank_problem

davies-w · 2024-01-15T02:52:12Z

I actually think the answer as given only applies if sample size (k) is O(m) than the largest number seen, see https://en.wikipedia.org/wiki/German_tank_problem#Example for Frequentist formula of m +m/k - 1 - EG if you have 4 data points, and the largest number is 60, the estimate is 60 + 60/4 - 1 or 74, the Bayesian value is close in this case too. Of course, if you'd sampled k = 15, and it was 60, then the estimate would be 63.

davies-w · 2024-01-15T02:56:01Z

Intuitively this makes sense. If your sample size is 1, you'd might be best of assuming that this was the middle value - and this is what the frequentist approach to the german tank problem gives you (m + m - 1)

davies-w · 2024-01-16T02:21:56Z

I just proved to myself the GTP is a better estimator using the following montecarlo approach.

import numpy as np
rng = np.random.default_rng()

lists = [([rng.integers(1,i) for j in range(rng.integers(1,i))], i) for i in rng.integers(2, 100, size=1000)]

a = 0
b = 0
for samples, y in lists:
  d1 = max(samples)
  d2 = max(samples) + max(samples)/len(samples) - 1 
  if np.abs(y-d1) < np.abs(y-d2):
    a += 1
  else:
    b += 1

print(f"{a=}, {b=}")

Gives: a=105, b=895 -- ie the solution given is only better than the Frequentist approach 10% of the time. Without too much extra work, one could show that there's some asymtope where if the number of samples approaches the size of d, that you can just use max(samples).

davies-w · 2024-01-16T02:46:36Z

And now I encoded the two Bayesian's in, and it looks like for k < d, that Bayesian Median works best!

import numpy as np
rng = np.random.default_rng()

lists = [([rng.integers(1, i) for j in range(rng.integers(1, i))], i) for i in rng.integers(2, 1000, size=10000)]

best = [0]*4

for samples, d in lists:
  m = max(samples)
  k = len(samples)
  est = []
  est.append(m)
  est.append(m + m/k - 1)
  if k > 1:
    est.append(m + (m*math.log(2))/(k-1))
  else:
    est.append(m + m/k - 1)
  if k>2:
    est.append(((m - 1) * (k - 1))/(k - 2))
  else:
    est.append(m + m/k - 1)
  diffs = np.abs(d - np.array(est))
  best[np.argmin(diffs)]+=1

print(f"{best=}")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q12 in Statistics #15

Q12 in Statistics #15

davies-w commented Jan 14, 2024

davies-w commented Jan 15, 2024

davies-w commented Jan 15, 2024

davies-w commented Jan 16, 2024 •

edited

Loading

davies-w commented Jan 16, 2024

Q12 in Statistics #15

Q12 in Statistics #15

Comments

davies-w commented Jan 14, 2024

davies-w commented Jan 15, 2024

davies-w commented Jan 15, 2024

davies-w commented Jan 16, 2024 • edited Loading

davies-w commented Jan 16, 2024

davies-w commented Jan 16, 2024 •

edited

Loading