-
Notifications
You must be signed in to change notification settings - Fork 1
/
good_turning_estimate.py
94 lines (65 loc) · 2.36 KB
/
good_turning_estimate.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
#!/usr/local/env python
"""
================================
Good–Turing frequency estimation
================================
Reference: https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation
X is the vocabulary set, so X = {x1, x2, x3, ..., xm}
W is a sequence with N independent sample, W = w1, w2, ..., wn, where wk ∈ X
ɵ[j] is the probability of xj being seen in the future.
Assumtion: if #(xj) = #(xk), then ɵ[j] = ɵ[k], in other word, xj and xk with the
same times in W, so they are with the same probability in general case.
θ(r) means the probability of a word occurring given that it appeared r times in W
Nr denote the number of item types that occur exactly r times in W
So, Nr = |{xj : #(xj ) = r}|.
Example, X = {the, dog, cat, bad}, W = {the, bad, cat, the, cat}
N0 = 1, cos dog not exist in W
N1 = 1, cos bad occurs 1 time
N2 = 2, cos the and cat both occurs 2 times
So, N = ∑ r * Nr
Good turning estimate for θ(r): θ(r) = 1 / N * (1 + r) * Nr+1 / Nr
θ(0) = N1 / N
For a group of items which occur r times, the probability would be:
(1 + r) * Nr+1 / N
"""
import re
import sys
import gzip
import argparse
from collections import Counter
from matplotlib import pyplot as plt
def trim(x):
return ''.join(re.findall(r'\w+', x.replace('\\n', '')))
def load_corpus(path):
fopen = open
if re.search(r'\S(.tar.gz)$', path):
fopen = gzip.open
with fopen(path, 'r') as f:
raw = ''.join(re.findall(r'\w+', f.read().decode('utf-8').replace('\\n', '')))
return raw
def get_nr(vocb: Counter, r: int):
"""
Get number of items with occurring r times in vocabularies
vocb = {'w1': r1, 'w2': r2, ..., 'wn': rn}
Nr = |{xj : #(xj ) = r}|
"""
return sum(1 if item[1] == r else 0 for item in vocb.items())
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('corpus', type=str, help='Corpus file path')
args = parser.parse_args(sys.argv[1:])
corpus = load_corpus(args.corpus)
vocb = Counter(corpus)
nrs = []
max_freq = 1000 # max = vocb.most_common(1)[0][1]
for i in range(max_freq):
print('Processing r = %d' % i)
if i == 0:
nr = get_nr(vocb, 1)
else:
nr = get_nr(vocb, i)
nrs.append(nr)
plt.plot(range(max_freq), nrs)
plt.xlabel('r')
plt.ylabel('Nr')
plt.show()