NOTE all of the benchmarking code requires a Linux based operating system due to the requirement to access the amount of memory used, using the resource getrusage
method.
In this section we benchmark the taggers (currently only the one tagger), based on resource utilisation (memory and speed) and performance. The performance uses two metrics, both are percentages:
- Accuracy
- Coverage -- the number of tokens that have been tagged, that are not tagged with the unmatched tag (the
Z99
tag).
Code to benchmark the rule based tagger:
python benchmarks/rule_based_tagger.py --markdown
Output, this is based on the Welsh gold standard dataset, from the paper Leveraging Pre-Trained Embeddings for Welsh Taggers.:
Memory (MB) | Tokens Per Second | Accuracy (%) | Coverage (%) |
---|---|---|---|
112.78 | 20,046 | 68.94 | 91.97 |
Note that between different computers these figures are going to be different. On the Apple MacBook Air 2021 (M1) this uses a lot more memory, but is quicker than the Ubuntu desktop. The figures above are generated from my AMD Ryzen 5 1600 Six-Core Processor with 16GB of RAM on the Ubuntu operating system.
- If
pos==punc
label asPUNCT
- Lookup token and pos tag
- Lookup lemma and pos tag
- Lookup lower case token and pos tag
- Lookup lower case lemma and pos tag
- if
pos==num
label asN1
- Lookup token with any POS tag and choose first entry in lexicon.
- Lookup lemma with any POS tag and choose first entry in lexicon.
- Lookup lower case token with any POS tag and choose first entry in lexicon.
- Lookup lower case lemma with any POS tag and choose first entry in lexicon.
- Label as
Z99
, this is the unmatched semantic tag.
- Multilingual USAS lexicons
- Welsh Semantic Tagger, Java version.
- Welsh gold standard dataset, this dataset uses the basic POS tags, see appendix A1 of this paper, from the CyTag POS tagger.
- Mapping basic CyTag POS tags to core POS tags used by the USAS lexicon.
- Detailed paper on the USAS tagset
The text from this sub-section has been copied from the TAGSET section of the USAS guide.
The semantic tags are composed of:
- an upper case letter indicating general discourse field.
- a digit indicating a first subdivision of the field.
- (optionally) a decimal point followed by a further digit to indicate a finer subdivision.
- (optionally) one or more ‘pluses’ or ‘minuses’ to indicate a positive or negative position on a semantic scale.
- (optionally) a slash followed by a second tag to indicate clear double membership of categories.
- (optionally) a left square bracket followed by ‘i’ to indicate a semantic template (multi-word unit).
Other symbols utilised:
- % = rarity marker (1)
- @ = rarity marker (2)
- f = female
- m = male
- c = potential antecedents of conceptual anaphors (neutral for number)
- n = neuter
- i = indicates a semantic idiom
Antonymity of conceptual classifications is indicated by +/- markers on tags Comparatives and superlatives receive double and triple +/- markers respectively. Certain words and collocational units show a clear double (and in some instances, triple) membership of categories. Such cases are dealt with using slash tags, that is, all tags are indicated and separated by a slash (e.g. anti-royal = E2-/S7.1+, accountant = I2.1/S2mf, bunker = G3/H1 K5.1/W3, Admiral = G3/M4/S2mf S7 1+/S2mf, dowry = S4/I1/A9-). The initial tagset was loosely based on Tom McArthur's Longman Lexicon of Contemporary English (McArthur, 1981) as this appeared to offer the most appropriate thesaurus type classification of word senses for this kind of analysis. We have since considerably revised the tagset in the light of practical tagging problems met in the course of the research. The revised tagset is arranged in a hierarchy with 21 major discourse fields expanding into 232 category labels.
The following table shows the 21 labels at the top level of the hierarchy.
A general and abstract terms |
B the body and the individual |
C arts and crafts |
E emotion |
F food and farming |
G government and public |
H architecture, housing and the home |
I money and commerce in industry |
K entertainment, sports and games |
L life and living things |
M movement, location, travel and transport |
N numbers and measurement |
O substances, materials, objects and equipment |
P education |
Q language and communication |
S social actions, states and processes |
T time |
W world and environment |
X psychological actions, states and processes |
Y science and technology |
Z names and grammar |