Benchmarking

NOTE all of the benchmarking code requires a Linux based operating system due to the requirement to access the amount of memory used, using the resource getrusage method.

In this section we benchmark the taggers (currently only the one tagger), based on resource utilisation (memory and speed) and performance. The performance uses two metrics, both are percentages:

Accuracy
Coverage -- the number of tokens that have been tagged, that are not tagged with the unmatched tag (the Z99 tag).

Rule based tagger

Code to benchmark the rule based tagger:

python benchmarks/rule_based_tagger.py --markdown

Output, this is based on the Welsh gold standard dataset, from the paper Leveraging Pre-Trained Embeddings for Welsh Taggers.:

Memory (MB)	Tokens Per Second	Accuracy (%)	Coverage (%)
112.78	20,046	68.94	91.97

Note that between different computers these figures are going to be different. On the Apple MacBook Air 2021 (M1) this uses a lot more memory, but is quicker than the Ubuntu desktop. The figures above are generated from my AMD Ryzen 5 1600 Six-Core Processor with 16GB of RAM on the Ubuntu operating system.

Rule based tagging process

If pos==punc label as PUNCT
Lookup token and pos tag
Lookup lemma and pos tag
Lookup lower case token and pos tag
Lookup lower case lemma and pos tag
if pos==num label as N1
Lookup token with any POS tag and choose first entry in lexicon.
Lookup lemma with any POS tag and choose first entry in lexicon.
Lookup lower case token with any POS tag and choose first entry in lexicon.
Lookup lower case lemma with any POS tag and choose first entry in lexicon.
Label as Z99, this is the unmatched semantic tag.

Resources

Multilingual USAS lexicons
Welsh Semantic Tagger, Java version.
Welsh gold standard dataset, this dataset uses the basic POS tags, see appendix A1 of this paper, from the CyTag POS tagger.
Mapping basic CyTag POS tags to core POS tags used by the USAS lexicon.
Detailed paper on the USAS tagset

Semantic Resources

USAS tagset

The text from this sub-section has been copied from the TAGSET section of the USAS guide.

The semantic tags are composed of:

an upper case letter indicating general discourse field.
a digit indicating a first subdivision of the field.
(optionally) a decimal point followed by a further digit to indicate a finer subdivision.
(optionally) one or more ‘pluses’ or ‘minuses’ to indicate a positive or negative position on a semantic scale.
(optionally) a slash followed by a second tag to indicate clear double membership of categories.
(optionally) a left square bracket followed by ‘i’ to indicate a semantic template (multi-word unit).

Other symbols utilised:

% = rarity marker (1)
@ = rarity marker (2)
f = female
m = male
c = potential antecedents of conceptual anaphors (neutral for number)
n = neuter
i = indicates a semantic idiom

Antonymity of conceptual classifications is indicated by +/- markers on tags Comparatives and superlatives receive double and triple +/- markers respectively. Certain words and collocational units show a clear double (and in some instances, triple) membership of categories. Such cases are dealt with using slash tags, that is, all tags are indicated and separated by a slash (e.g. anti-royal = E2-/S7.1+, accountant = I2.1/S2mf, bunker = G3/H1 K5.1/W3, Admiral = G3/M4/S2mf S7 1+/S2mf, dowry = S4/I1/A9-). The initial tagset was loosely based on Tom McArthur's Longman Lexicon of Contemporary English (McArthur, 1981) as this appeared to offer the most appropriate thesaurus type classification of word senses for this kind of analysis. We have since considerably revised the tagset in the light of practical tagging problems met in the course of the research. The revised tagset is arranged in a hierarchy with 21 major discourse fields expanding into 232 category labels.

The following table shows the 21 labels at the top level of the hierarchy.

A general and abstract terms	B the body and the individual	C arts and crafts	E emotion
F food and farming	G government and public	H architecture, housing and the home	I money and commerce in industry
K entertainment, sports and games	L life and living things	M movement, location, travel and transport	N numbers and measurement
O substances, materials, objects and equipment	P education	Q language and communication	S social actions, states and processes
T time	W world and environment	X psychological actions, states and processes	Y science and technology
Z names and grammar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

old_readme_information.md

old_readme_information.md

Benchmarking

Rule based tagger

Rule based tagging process

Resources

Semantic Resources

USAS tagset

Files

old_readme_information.md

Latest commit

History

old_readme_information.md

File metadata and controls

Benchmarking

Rule based tagger

Rule based tagging process

Resources

Semantic Resources

USAS tagset