-
Notifications
You must be signed in to change notification settings - Fork 1
/
details-en.html
149 lines (149 loc) · 7.41 KB
/
details-en.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
<h2>Using the 3D Browser</h2>
<p>
First, enter a word or phrase into the search box.
If the word has more than one possible part of speech, you will be asked
to select the one you want. In most cases, though, a Java applet will
automatically load in the space underneath the search box, displaying a
small portion of the LSG centered at your chosen word.
</p>
<p>
Here are the possible actions within the 3D browser:
<ul>
<li><b>Rotate</b>. Click and drag on empty space to rotate the graph.
By releasing while the mouse is in motion, you can cause the graph to
keep spinning.</li>
<li><b>Navigate</b>. You can travel through the network by clicking
nodes. This will center the node that you click and will redraw the
portion of the LSG nearest the new center.</li>
<li><b>Get Info</b>. Hover the cursor over a node and the part-of-speech
and declension of the word will appear in a tiny popup window.</li>
<li><b>Drag Nodes</b>. Click and drag a node to move it around. When you
release, it bounces back into place.</li>
</ul>
</p>
<p>
The green nodes are “hubs” — words representing
core meanings. Each red node corresponds to an orthographic word and is
connected to the green hub that gives its core meaning (or to
more than one green hub if the word is ambiguous).
Red-Red links are synonyms — they get
organized in rings around a hub. It's the Green-Green links that
provide the real richness of the network — these indicate
more general semantic relationships like hypernymy/hyponymy between
core meanings.
</p>
<p>
For example, if you enter the word <i>meirbhe</i> into the search box,
you should see something like this in your browser:
</p>
<p class="centered">
<img src="meirbhe.png" alt="LSG graph image, centered on 'meirbhe'">
</p>
<p>
This means that the word <i>meirbhe</i> has three basic senses, indicated
by the green hubs <i>brothall</i> (En. “sultriness, humidity”),
<i>easpa aeir</i> (En. “stuffiness”), and <i>lagachar</i> (En. “weakness”).
If you click the green <i>brothall</i> node, you'll see this:
</p>
<p class="centered">
<img src="brothall.png" alt="LSG graph image, centered on 'brothall'">
</p>
<p>
Now <i>brothall</i> is centered and is surrounded by five red nodes
which are synonyms of this core sense. In addition, there is one
semantically-related green node labelled <i>teaspach</i> (“hot weather”).
Had you chosen the <i>lagachar</i> sense, things get much more complicated:
</p>
<p class="centered">
<img src="lagachar.png" alt="LSG graph image, centered on 'lagachar'">
</p>
<p>
First of all, as sometimes happens, the green <i>lagachar</i> node
actually represents two core senses: “weakness” and also “faintness”.
So you'll see some red nodes corresponding primarily to the first sense
(<i>lag</i>), others more like the second sense (<i>meirfean</i>), but
many others that correspond to both. This makes for some very rich
interactions in the network. Also, there are a number of semantically-related
green nodes that may be hard to make out in the static image:
<i>marbhántacht</i> (“lethargy”), <i>soghontacht</i> (“vulnerability”),
<i>míthathag</i> (“flimsiness“), <i>éalang</i> (“a weak spot”), etc.
</p>
<hr>
<h2>Creating the database</h2>
<p>
This project stretches back to 2002 when I created a simple Irish
language thesaurus based on the free Roget's Thesaurus from
<a target="_blank" href="https://www.gutenberg.org/">Project Gutenberg</a>.
I presented that work at the TALN 2003 conference in Batz-sur-Mer, France:
<a target="_blank" href="https://kevinscannell.com/files/teas.pdf">Automatic thesaurus generation for minority languages: an Irish example</a>.
I was never very happy with the quality of the
resulting thesaurus, mostly because it inherited all of the flaws
of the free Roget's: an unwieldy structure not particularly useful for
computational applications (long lists of quasi-synonyms grouped
under general headings), a lack of modern terminology (the free Roget's is
from the 1913 edition), and no rich semantic relations like hypernymy/hyponymy.
Because of this I decided not to distribute the thesaurus widely until I could
make significant improvements in its quality.
</p>
<p>
As you'll read in the TALN paper linked above, it was clear to me even back
then that many of the quality issues could be resolved simply by using
the Princeton WordNet in place of Roget's. This has turned out to be true.
Also important, though, is that I now have much more robust methods for
disambiguating English translations of Irish words, and this has
also improved matters greatly.
</p>
<p>
The heart of the matter in creating the Irish wordnet is to map
Irish words over to one or more English synsets.
This first step in doing this is to use the
the English “glosses” that are part of my Irish lexical database.
These are usually one- or two-word definitions like the ones you would
find in Ó Dónaill's dictionary.
When the English glosses are unambiguous, there is no problem:
<i>stáplóir</i> is glossed as “stapler” and this lies in a unique
English synset. The hard work comes in disambiguating glosses like
“bank”, or “ball”, or “flag”.
</p>
<p>
My approach to disambiguation uses a corpus of
English texts and their Irish translations, aligned sentence-by-sentence.
Consider for example the word <i>bruach</i> which has “bank” as one of its
glosses. I extract all Irish sentences containing <i>bruach</i>
(or <i>bhruach</i>,
<i>mbruach</i>, etc.) and the corresponding English sentences.
Some of the English sentences will contain the word “bank”, and the
hope is that the additional context provided by these sentences can be used
to decide which is the correct sense of “bank” using standard techniques
in word-sense disambiguation.
To ensure enough data are available, I am very inclusive
about what goes into the bilingual corpus. For example, the Irish words
and their glosses are included, even though they do not form
complete sentences.
In fact, the glosses alone are often sufficient to determine
the correct sense. This fact is well-known to lexicographers,
including Ó Dónaill et al, who gloss words like <i>feileastram</i>
with two ambiguous English words
(“flag, iris”) but with no fear of confusion.
The original version of the thesaurus from
2003 relied primarily on the glosses for disambiguation since
I had not yet created the big bilingual corpus.
</p>
<p>
Another important technical note: I don't in fact map Irish words
directly to Princeton synsets. This is because for many words, the
sense distinctions made by the Princeton lexicographers are too fine
even for intelligent non-lexicographers to make reliably, and certainly
too fine for statistical methods.
In addition, there are many distinctions made in Irish that are not made in
English (e.g. “dearg” vs. “rua” in Irish would map to a single
Princeton synset) and these are precisely the distinctions one does
not want to give up in a monolingual Irish language resource.
So for these reasons I added a separate layer — an “intermediate wordnet” —
between Irish and English with mappings in both directions. It's still
really an English wordnet, but one that is tailored to the needs of Irish.
Tomás de Bhaldraithe's English-Irish dictionary was very useful in
constructing this; the senses given by de Bhaldraithe under
each English word give a rough first approximation
of the sense inventory of the intermediate wordnet.
</p>