Universal Dependencies syntax annotations from the GUM corpus (https://gucorpling.org/gum/)
GUM, the Georgetown University Multilayer corpus, is an open source collection of richly annotated texts from multiple text types. The corpus is collected and expanded by students as part of the curriculum in the course LING-4427 "Computational Corpus Linguistics" at Georgetown University. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (usually Creative Commons licenses), so that new texts can be annotated and published with ease.
The dependencies in the corpus up to GUM version 5 were originally annotated using Stanford Typed Depenencies (de Marneffe & Manning 2013) and converted automatically to UD using DepEdit (https://gucorpling.org/depedit/). The rule-based conversion took into account gold annotations found in other annotation layers of the GUM corpus (e.g. entity annotations), and has since been corrected manually in native UD. The original conversion script used can found in the GUM build bot code from version 5, available from the (non-UD) GUM repository. Documents from version 6 of GUM onwards were annotated directly in UD, and subsequent manual error correction to all GUM data has also been done directly using the UD guidelines. Enhanced dependencies were added semi-automatically from version 7.1 of the corpus. For more details see the corpus website.
The MISC column contains morphological segmentation, Construction Grammar, entity, coreference, information status, Wikification and discourse annotations from the full GUM corpus, encoded using the annotations MSeg
, Cxn
, Entity
, SplitAnte
, Bridge
and Discourse
.
Morphological segmentation in GUM is annotated in the MISC field MSeg
attribute semi-automatically using the Unimorph lexical resource (Kirov et al. 2018), specifically using scripts based on the lexicon data here. Analyses are concatenative, using hyphens as separators, and are guaranteed to sum up to the string of each token with only hyphens added. Existing hyphens in a word form are retained and assumed to be meaningful. Analyses cover inflection, derivation and compounding. For example:
- books -> book-s
- explanation -> explan-ation
- baseball -> base-ball
- e-mail -> e-mail (hyphen is retained, presumably meaningful)
Note that stems are retained in their orthographic forms (explanation does not become explain+ation), and 'etymological affixation' in loanwords is not necessarily analyzed (e.g. "ex" is not split off since the corresponding affixation process is no longer interpretable in English). For more information and updates to the segmentation guidelines see the GUM wiki.
GUM uses the MISC field Cxn
annotation to distinguish some complex constructions in a Construction Grammar (CxG) framework developed by collaborators from Dagstuhl Seminar 23191 for the integration of CxG analyses into UD trees. Construction labels are always attached to the highest token belonging to the necessary or defining elements of the construction, and carry hierarchical designations, such as a prefix Cxn=Conditional
for all conditional constructions, but a more specific Cxn=UnspecifiedEpistemic-Reduced
for reduced conditionals (the type seen in "if possible"). Individual elements of a construction are annotated using the CxnElt
MISC annotation. Currently covered constructions are listed in the GUM wiki. For more information and for work using these annotations, please refer to Weissweiler et al. 2024.
The Entity
annotation uses the CoNLL 2012 shared task bracketing format, which identifies potentially coreferring entities using round opening and closing brackets as well as a unique ID per entity, repeated across mentions. In the following example, actor Jared Padalecki appears in a single token mention, labeled (1-person-giv:act-cf2*-1-coref-Jared_Padalecki)
indicating the entity type (person
) combined with the unique ID of all mentions of Padalecki in the text (1-person
). Because Padalecki is a named entity with a corresponding Wikipedia page, the Wikification identifier corresponding to his Wikipedia page is given after the last hyphen (1-person-Jared_Padalecki
). We can also see an information status annotation (giv:act
, indicating an aforementioned or 'given' entity, actively mentioned last no farther than the previous sentences; see Dipper et al. 2007), a Centering Theory annotation (cf2*
, indicating he is the second most central salient entity in the sentence moving forward, and that he was mentioned in the previous sentence, indicated by the *
), as well as minimum token ID information indicating the head tokens for fuzzy matching (in this case 1
, the first and only token in this span) and the coreference type coref
, indicating lexical subsequent mention. The labels for each part of the hyphen-separated annotation are given at the top of each document in a comment # global.Entity = GRP-etype-infstat-centering-minspan-link-identity
, indicating that these annotations consist of the entity group id (i.e the coreference group), entity type, information status, centering theory annotation, minimal span of tokens for head matching, the coreference link type, and named entity identity (if available).
Multi-token mentions receive opening brackets on the line in which they open, such as (97-person-giv:inact-cf4-1,3-coref-Jensen_Ackles
, and a closing annotation 97)
at the token on which they end. Multiple annotations are possible for one token, corresponding to nested entities, e.g. (175-time-giv:inact-cf5-1-coref)189)188)
below corresponds to the single token and last token of the time entities "2015" and "April 2015" respectively, as well as the last token of the larger "the second campaign in the Always Keep Fighting series in April 2015".
# global.Entity = GRP-etype-infstat-centering-minspan-link-identity
...
1 For for ADP IN _ 4 case 4:case Discourse=joint-sequence_m:104->98:2:lex-indph-954-955
2 the the DET DT Definite=Def|PronType=Art 4 det 4:det Bridge=173<188|Entity=(188-event-acc:inf-cf6-3,6,8-sgl
3 second second ADJ JJ Degree=Pos|NumForm=Word|NumType=Ord 4 amod 4:amod _
4 campaign campaign NOUN NN Number=Sing 16 obl 16:obl:for _
5 in in ADP IN _ 10 case 10:case _
6 the the DET DT Definite=Def|PronType=Art 10 det 10:det Entity=(173-abstract-giv:inact-cf3-2,4,5-coref
7 Always Always ADV NNP _ 8 advmod 8:advmod MSeg=Al-way-s|XML=<hi rend:::"italic">
8 Keep Keep VERB NNP Mood=Imp|Person=2|VerbForm=Fin 10 compound 10:compound _
9 Fighting Fighting VERB NNP VerbForm=Ger 8 xcomp 8:xcomp MSeg=Fight-ing|XML=</hi>
10 series series NOUN NN Number=Sing 4 nmod 4:nmod:in Entity=173)
11 in in ADP IN _ 12 case 12:case _
12 April April PROPN NNP Number=Sing 4 nmod 4:nmod:in Entity=(189-time-new-cf10-1-sgl|XML=<date when:::"2015-04">
13 2015 2015 NUM CD NumForm=Digit|NumType=Card 12 nmod:unmarked 12:nmod:unmarked Entity=(175-time-giv:inact-cf5-1-coref)189)188)|SpaceAfter=No|XML=</date>
14 , , PUNCT , _ 4 punct 4:punct _
15 Padalecki Padalecki PROPN NNP Number=Sing 16 nsubj 16:nsubj Entity=(1-person-giv:act-cf2*-1-coref-Jared_Padalecki)
16 partnered partner VERB VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root 0:root MSeg=partner-ed
17 with with ADP IN _ 18 case 18:case _
18 co-star co-star NOUN NN Number=Sing 16 obl 16:obl:with Entity=(97-person-giv:inact-cf4-1,3-coref-Jensen_Ackles|MSeg=co-star
19 Jensen Jensen PROPN NNP Number=Sing 18 appos 18:appos XML=<ref target:::"https://en.wikipedia.org/wiki/Jensen_Ackles">
20 Ackles Ackles PROPN NNP Number=Sing 19 flat 19:flat Entity=97)|XML=</ref>
21 to to PART TO _ 22 mark 22:mark Discourse=purpose-goal:105->104:0:syn-inf-963|PDTB=Implicit:Contingency.Purpose.Arg2-as-goal:in order:_:943-962:963-981
22 release release VERB VB VerbForm=Inf 16 advcl 16:advcl:to _
23 a a DET DT Definite=Ind|PronType=Art 24 det 24:det Entity=(190-object-new-cf7-2-coref
24 shirt shirt NOUN NN Number=Sing 22 obj 22:obj Entity=190)
25 featuring feature VERB VBG VerbForm=Ger 24 acl 24:acl Discourse=elaboration-attribute:106->105:0:syn-mdf-966+syn-nmn-967|MSeg=featur-ing
26 both both DET DT PronType=Tot 25 obj 25:obj Entity=(191-object-new-cf9-1-sgl
27 of of ADP IN _ 29 case 29:case _
28 their their PRON PRP$ Case=Gen|Number=Plur|Person=3|Poss=Yes|PronType=Prs 29 nmod:poss 29:nmod:poss Entity=(192-person-acc:aggr-cf1-1-coref)|SplitAnte=1<192,97<192
29 faces face NOUN NNS Number=Plur 26 nmod 26:nmod:of Entity=191)|MSeg=face-s|SpaceAfter=No
In addition, a list of the globally most salient entities in each document can be found in the metadata at the beginning of the document, for example:
# meta::salientEntities = 1, 5, 6, 7, 8, 12, 98, 173, 180, 181, 182, 183, 184
Where the value 1
stands for Padalecki, as in the annotations above.
Possible values for the other annotations mentioned above are:
- entity type: abstract, animal, event, object, organization, person, place, plant, substance, time
- information status
- new - not previously mentioned
- giv:act - mentioned no further than one sentence ago
- giv:inact - mentioned earlier
- acc:inf - accessible, inferable from some previous mention (e.g. the house... [the door])
- acc:aggr - accessible, aggregate, i.e. split antecedent mediated by a set of previous mentions
- acc:com - accessible, common ground, i.e. generic ([the world]) or situationally accessible ("pass [the salt]", first mention of "you" or "I")
- centering:
- rank in the forward looking center (Cf), and a '*' for the top entity also mentioned in the previous sentence (Cb). The preferred forward looking center (Cp) is simply expressed as cf1.
- centering transition types are computed from these annotations in the sentence level
# transition
annotations
- link:
- ana - pronominal anaphora (the dancers ... [they])
- appos - apposition (Kim, [the lawyer])
- cata - cataphora ("In [their] speech, the athletes said", or expletive cataphora: "[it] is easy [to dance]")
- coref - lexical coreference (e.g. [Kim] ... [Kim])
- disc - discourse deixis, non-NP, e.g. verbal antecedent as in "[Kim arrived] - [this] delighted the children
- pred - predication, e.g. Kim is [a teacher] (but NOT definite identification: This is Kim)
- sgl - singleton, not mentioned again in document
- identity: any Wikipedia article title
- minspan: a number or set of comma-separated numbers indicating indices of minimal head tokens within the span of the mention (first in span: 1, etc.)
For equivalent Wikidata identifiers for each Wikipedia article title, see this file.
The annotations SplitAnte
and Bridge
mark non-strict identity anaphora (see the Universal Anaphora project for more details). For example, at token 28 in the example, the pronoun "their" refers back to two non-adjacent entities, requiring a split antecedent annotation. The value SplitAnte=1<192,97<192
indicates that 192-person
(the pronoun "their") refers back to two previous Entity annotations, with pointers separatated by a comma: 1
(1-person-...Jared_Padalecki
) and 97
(97-person-...Jensen_Ackles
).
Bridging anaphora is annotated when an entity has not been mentioned before, but is resolvable in context by way of a different entity: for example, token 2 has the annotation Bridge=173<188
, which indicates that although 188-event
("the second campaign...") has not been mentioned before, its identity is mediated by the previous mention of another entity, 173-abstract
(the project "Always Keep Fighting", mentioned earlier in the document, to which the campaign event belongs). In other words, readers can infer that "the second campaign" is part of the already introduced larger project, which also had a first campaign. This inference also leads to the information status label acc:inf
, accessible-inferable.
Discourse annotations are given in eRST dependencies following the conversion from RST constituent trees as suggested by Li et al. (2014) - for the original RST constituent parses of GUM see the source repo. At the beginning of each Elementary Discourse Unit (EDU), an annotation Discourse
gives the discourse function of the unit beginning with that token, followed by a colon, the ID of the current unit, and an arrow pointing to the ID of the parent unit in the discourse parse. For instance, Discourse=purpose-goal:105->104:0:syn-inf-963
at token 21 in the example below means that this token begins discourse unit 105, which functions as a purpose-goal
to unit 104, which begins at token 1 in this sentence ("Padalecki partnered with co-star Jensen Ackles --purpose-goal-> to release a shirt..."). The third number :0
indicates that the attachment has a depth of 0, without an intervening span in the original RST constituent tree (this information allows deterministic reconstruction of the RST constituent discourse tree from the conllu file). The final part of the Discourse
annotation indicates categorized signals which correspond to the discourse relation in question, as defined by eRST - in this case, syn-inf-963
indicates a syntactic signal (syn
) of the subtype "infinitival_clause" (inf
), since the purpose relation is signaled by the use of an infinitive, a typical strategy in English. The index 963
refers to the position of the signal, in this case token number 963 in the document (excluding empty nodes), the infinitive 'to' (token 21 in the sentence). Multiple signals are separated by +
. See below for the inventory of signal types.
Additionally, note that multiple discourse relations can sometimes occur on the same line, since eRST allows multiple concurrent and tree-breaking relations to be identified. In such cases the multiple relation entries will be separated by ;
and ordered such that the primary relation (which indicates RST nuclearity and is guaranteed to be projective in the discourse tree) will be serialized first, and non-projective secondary relations are guaranteed to be serialized subsequently. The unique ROOT
node of the discourse tree has no arrow notation, e.g. Discourse=ROOT:2:0
means that this token begins unit 2, which is the Central Discourse Unit (or discourse root) of the current document. Although it is easiest to recover RST constituent trees from the source repo, it is also possible to generate them automatically from the dependencies with depth information, using the scripts in the rst2dep repo.
Discourse relations in GUM are defined based on the effect that W (a writer/speaker) has on R (a reader/hearer) by modifying a Nucleus discourse unit (N) with another discourse unit (a Satellite, S, or another N). Discourse relation units can precede their nuclei (satellite-nucleus, or SN relation), follow them (NS), or be coordinated with each other (NN or multinuclear relations). Relations are classified hierarchically into 15 major classes and include:
- Adversative
- adversative-antithesis (SN/NS) - R is meant to prefer N as an alternative to S
- adversative-concession (SN/NS) - R is meant to look past an incompatibility of N with S
- adversative-contrast (NN) - W presents multiple Ns as incompatible, but of equal prominence
- Attribution
- attribution-positive (SN/NS) - S states a source for the information in N
- attribution-negative (SN/NS) - S states that a potential source is NOT a source of the information in N
- Causal
- causal-cause (SN/NS) - S is the cause of N (and N is more prominent)
- causal-result (SN/NS) - S is the result of N (or: N is the cause of S, and N is more prominent)
- Context
- context-background (SN/NS) - S provides prerequisite information to increase R's understanding of N
- context-circumstance (SN/NS) - S details circumstances (often spatio-temporal) under which N applies
- Contingency
- contingency-condition (SN/NS) - N occurs (or not) depending on S
- Elaboration
- elaboration-attribute (NS) - S gives additional information about a participant within N (not on the entire proposition in N)
- elaboration-additional (NS) - S gives additional information about the proposition in N as a whole
- Explanation
- explanation-evidence (SN/NS) - S provides evidence which increases R's belief in N
- explanation-justify (SN/NS) - S increases R's acceptance of W's right to say N
- explanation-motivation (SN/NS) - S is meant to influence R's willingness to act according to N
- Evaluation
- evaluation-comment (SN/NS) - S provides an assessment of N by W (R does not have to share this assessment)
- Joint
- joint-disjunction (NN) - W presents multiple Ns which can be regarded as interchangeable alternatives
- joint-list (NN) - W presents multiple Ns in parallel which are additive, of equal prominence, and of equivalent purpose
- joint-sequence (NN) - Multiple Ns form a temporally ordered sequence of events presented in chronological order
- joint-other (NN) - a collection of unlike Ns of equal prominence, but of disparate (non-equivalent) discourse purpose
- Mode
- mode-manner (SN/NS) - S indicates the manner in which N happens
- mode-means (SN/NS) - S indicates the means by which N happens
- Organization
- organization-heading (SN) - S prepared R for N using an explicit text organizing device such as a heading
- organization-phatic (SN/NS) - S prepares R for N by holding the floor for W, without contributing propositional content
- organization-preparation (SN) - covers all other forms of S units primarily used to signal an upcoming N
- Purpose
- purpose-attribute (SN/NS) - S gives the purpose of a participant within N (not the entire propostion in N)
- purpose-goal (SN/NS) - the proposition in N as a whole is initiated or exists in order to realize S
- Restatement
- restatement-partial (NS) - S partly realizes the same role and content as a previous N
- restatement-repetition (NN) - Multiple Ns of equal prominence realize the same role and content
- Topic
- topic-question (SN) - S steers the discourse topic by posing a question to which N is the answer
- topic-solutionhood (SN/NS) - S steers the discourse topic by posing a problem, to which N presents a solution
- Same-unit (NN) - connects parts of a discontinuous discourse unit (this is not a discourse relation)
Relation signals fall into nine major classes, most with several subtypes each, and include:
- dm: discourse markers of primary relations ('but', 'additionally', 'on the other hand'...)
- orphan (orp): discourse markers of secondary relations
- graphical (grf): colon (col), dash (dsh), items_in_sequence (seq), layout (ly), parentheses (prn), quotation_marks (qt), question_mark (qst), semicolon (semcol)
- lexical (lex): alternate_expression (altlex), indicative_phrase (indph), indicative_word (indwd)
- morphological (mrf): mood (md), tense (tns)
- numerical (num): same_count (count)
- reference (ref): comparative_reference (cmp), demonstrative_reference (dem), general_word (gnrl), personal_reference (prs), propositional_reference (prop)
- semantic (sem): antonymy (antnm), attribution_source (atsrc), lexical_chain (lxchn), meronymy (mrnym), negation (ngt), repetition (rpt), synonymy (synym)
- syntactic (syn): subject_auxiliary_inversion (sbinv), infinitival_clause (inf), interrupted_matrix_clause (intrp), modified_head (mdf), nominal_modifier (nmn), parallel_syntactic_construction (prl), past_participial_clause (pst), present_participial_clause (pres), relative_clause (relcl), reported_speech (rpr)
With the publication of the GUM Discourse Treebank (GDTB), a shallow version of discourse relation annotations is now included in the PDTB
key in the MISC field, which provides information for all Explicit, Implicit, AltLex, AltLexC, EntRel, Hypophora and NoRel annotations following the Penn Discourse Treebank (PDTB) v3 guidelines. Annotations are placed on the first token of the connective or alternative lexicalization marking the relation for explicit/altlex relations, or on the first token of the second argument span (arg2) for other cases. Token ranges for each argument span, the connective and relation label are provided as well. For example, the line in the excerpt above:
21 to to PART TO _ 22 mark 22:mark Discourse=purpose-goal:105->104:0:syn-inf-963|PDTB=Implicit:Contingency.Purpose.Arg2-as-goal:in order:_:943-962:963-981
Indicates an Implicit relation with the label Contingency.Purpose.Arg2-as-goal
, with an implicit connective "in order". Because the connective is implicit, it has no token indices (_
), but arg1 spans token943-962
of the document (ignoring decimal ellipsis tokens), and arg2 spans tokens 963-981
. If multiple PDTB relations apply at the same token position, they are separated by a semicolon.
Markup from the original XML annotations using TEI tags is available in the XML MISC annotation, which indicates which XML tags, if any, were opened or closed before or after the current token, and in what order. In tokens 7-9 in the example above, the XML annotations indicate the words "Always Keep Fighting" were originally italicized using the tag pair <hi rend="italic">...</hi>
, which opens at token 7 and closes after token 9. To avoid confusion with the =
sign in MISC annotations, XML =
signs are escaped and represented as :::
.
7 Always Always ADV NNP Number=Sing 8 advmod 8:advmod XML=<hi rend:::"italic">
8 Keep Keep PROPN NNP Number=Sing 10 compound 10:compound _
9 Fighting Fighting PROPN NNP Number=Sing 8 xcomp 8:xcomp XML=</hi>
XML block tags spanning whole sentences (i.e. not beginning or ending mid sentence), such as paragraphs (<p>
) or headings (<head>
) are instead represented using the standard UD # newpar_block
comment under the # newpar
comment, which may however feature nested tags, for example:
# newpar
# newpar_block = list type:::"unordered" (10 s) | item (4 s)
This comment indicates the opening of a <list type="unordered">
block element, which spans 10 sentences ((10 s)
). However, the list begins with a nested block, a list item (i.e. a bullet point), which spans 4 sentences, as indicated after the pipe separator. For documentation of XML elements in GUM, please see the GUM wiki.
More information and additional annotation layers can also be found in the GUM source repo.
Document metadata is given at the beginning of each new document in key-value pair comments beginning with the prefix meta::
, as in:
# newdoc id = GUM_bio_padalecki
# global.Entity = GRP-etype-infstat-centering-minspan-link-identity
# meta::author = Wikipedia, The Free Encyclopedia
# meta::dateCollected = 2019-09-10
# meta::dateCreated = 2004-08-14
# meta::dateModified = 2019-09-11
# meta::genre = bio
# meta::salientEntities = 1, 5, 6, 7, 8, 12, 98, 173, 180, 181, 182, 183, 184
# meta::sourceURL = https://en.wikipedia.org/wiki/Jared_Padalecki
# meta::speakerCount = 0
# meta::summary = Jared Padalecki is an award winning American actor who gained prominence in the series Gilmore Girls, best known for playing the role of Sam Winchester in the TV series Supernatural, and for his active role in campaigns to support people struggling with depression, addiction, suicide and self-harm.
# meta::title = Jared Padalecki
Document summaries are included in the metadata summary
annotation and follow strict guidelines described here. For the test set, a second human written summary is available called summary2
.
Additionally, sentences carry some sentence-level annotations in CoNLL-U comment annotations, such as sentence types in s_type
(declarative, imperative, wh-question, fragment, etc.), as well as sentence transition types based on Centering Theory and sentence prominence levels based on graph proximity to the discourse parse root. For example, this fragment sentence (frag
) establishes a new backwards looking Center (establishment
) and is a level-2 sentence (s_prominence = 2
, i.e. its discourse nesting level is one further than a sentence containing the level-1 Central Discourse Unit of the entire text.
# s_prominence = 2
# s_type = frag
# transition = establishment
# text = Jared Padalecki
1 Jared Jared PROPN NNP Number=Sing 0 root 0:root MSeg=Jared
2 Padalecki Padalecki PROPN NNP Number=Sing 1 flat 1:flat _
The training, development and test sets contain complete, contiguous documents, balanced for genre. Test and dev contain similar amounts of data, usually around 1,800 tokens in each genre in each, and the rest is assigned to training. For the exact file lists in each split see:
https://github.com/UniversalDependencies/UD_English-GUM/tree/master/not-to-release/file-lists
GUM annotation team (so far - thanks for participating!)
Adrienne Isaac, Akitaka Yamada, Alex Giorgioni, Alexandra Berends, Alexandra Slome, Amani Aloufi, Amber Hall, Amelia Becker, Andrea Price, Andrew O'Brien, Ángeles Ortega Luque, Aniya Harris, Anna Prince, Anna Runova, Anne Butler, Arianna Janoff, Aryaman Arora, Ayan Mandal, Aysenur Sagdic, Bertille Baron, Bradford Salen, Brandon Tullock, Brent Laing, Caitlyn Pineault, Calvin Engstrom, Candice Penelton, Carlotta Hübener, Caroline Gish, Charlie Dees, Chenyue Guo, Chloe Evered, Cindy Luo, Colleen Diamond, Connor O'Dwyer, Cristina Lopez, Cynthia Li, Dan DeGenaro, Dan Simonson, Derek Reagan, Devika Tiwari, Didem Ikizoglu, Edwin Ko, Eliza Rice, Emile Zahr, Emily Pace, Emma Manning, Emma Rafkin, Ethan Beaman, Felipe De Jesus, Han Bu, Hana Altalhi, Hang Jiang, Hannah Wingett, Hanwool Choe, Hassan Munshi, Helen Dominic, Ho Fai Cheng, Hortensia Gutierrez, Jakob Prange, James Maguire, Janine Karo, Jehan al-Mahmoud, Jemm Excelle Dela Cruz, Jess Godes, Jessica Cusi, Jessica Kotfila, Jingni Wu, Joaquin Gris Roca, John Chi, Jongbong Lee, Juliet May, Jungyoon Koh, Katarina Starcevic, Katelyn Carroll, Katelyn MacDougald, Katherine Vadella, Khalid Alharbi, Kristen Cook, Lara Bryfonski, Lauren Levine, Leah Northington, Lindley Winchester, Linxi Zhang, Lucia Donatelli, Luke Gessler, Mackenzie Gong, Margaret Anne Rowe, Margaret Borowczyk, Maria Laura Zalazar, Maria Stoianova, Mariko Uno, Mary Henderson, Maya Barzilai, Md. Jahurul Islam, Michael Kranzlein, Michaela Harrington, Mingyeong Choi, Minnie Annan, Mitchell Abrams, Mohammad Ali Yektaie, Naomee-Minh Nguyen, Negar Siyari, Nicholas Mararac, Nicholas Workman, Nicole Steinberg, Nitin Venkateswaran, Parker DiPaolo, Phoebe Fisher, Rachel Kerr, Rachel Thorson, Rebecca Childress, Rebecca Farkas, Riley Breslin Amalfitano, Rima Elabdali, Robert Maloney, Ruizhong Li, Ryan Mannion, Ryan Murphy, Sakol Suethanapornkul, Sarah Bellavance, Sarah Carlson, Sasha Slone, Saurav Goswami, Sean Macavaney, Sean Simpson, Seyma Toker, Shane Quinn, Shannon Mooney, Shelby Lake, Shira Wein, Sichang Tu, Siddharth Singh, Siona Ely, Siyao Peng, Siyu Liang, Stephanie Kramer, Sylvia Sierra, Talal Alharbi, Tatsuya Aoyama, Tess Feyen, Timothy Ingrassia, Trevor Adriaanse, Ulie Xu, Wai Ching Leung, Wenxi Yang, Wesley Scivetti, Xiaopei Wu, Xiulin Yang, Yang Liu, Yi-Ju Lin, Yifu Mu, Yilun Zhu, Yingzhu Chen, Yiran Xu, Young-A Son, Yu-Tzu Chang, Yuhang Hu, Yunjung Ku, Yushi Zhao, Zhijie Song, Zhuosi Luo, Zhuxin Wang, Amir Zeldes
... and other annotators who wish to remain anonymous!
The best paper to cite depends on the data you are using. To cite the corpus in general, please refer to the following article (but note that the corpus has changed and grown a lot in the time since); otherwise see different citations for specific aspects below:
Zeldes, Amir (2017) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation 51(3), 581–612.
@Article{Zeldes2017,
author = {Amir Zeldes},
title = {The {GUM} Corpus: Creating Multilayer Resources in the Classroom},
journal = {Language Resources and Evaluation},
year = {2017},
volume = {51},
number = {3},
pages = {581--612},
doi = {http://dx.doi.org/10.1007/s10579-016-9343-x}
}
If you are using the Reddit subset of GUM in particular, please use this citation instead:
- Behzad, Shabnam and Zeldes, Amir (2020) "A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging". In: Proceedings of the 12th Web as Corpus Workshop (WAC-XII).
@InProceedings{BehzadZeldes2020,
author = {Shabnam Behzad and Amir Zeldes},
title = {A Cross-Genre Ensemble Approach to Robust {R}eddit Part of Speech Tagging},
booktitle = {Proceedings of the 12th Web as Corpus Workshop (WAC-XII)},
pages = {50--56},
year = {2020},
}
For papers focusing on the discourse relations, discourse markers or other discourse signal annotations, please cite the eRST paper:
@misc{ZeldesEtAl2024,
title={{eRST}: A Signaled Graph Theory of Discourse Relations and Organization},
author={Amir Zeldes and Tatsuya Aoyama and Yang Janet Liu and Siyao Peng and Debopam Das and Luke Gessler},
year={2024},
eprint={2403.13560},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2403.13560}
}
For papers using GDTB/PDTB style shallow discourse relations, please cite:
- Yang Janet Liu, Tatsuya Aoyama, Wesley Scivetti, Yilun Zhu, Shabnam Behzad, Lauren Elizabeth Levine, Jessica Lin, Devika Tiwari, and Amir Zeldes (2024), "GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains". In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics: Miami, USA.
@inproceedings{liu-etal-2024-GDTB,
title = "GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains",
author = "Yang Janet Liu and Tatsuya Aoyama and Wesley Scivetti and Yilun Zhu and Shabnam Behzad and Lauren Elizabeth Levine and Jessica Lin and Devika Tiwari and Amir Zeldes",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, USA",
publisher = "Association for Computational Linguistics",
abstract = "Work on shallow discourse parsing in English has focused on the Wall Street Journal corpus, the only large-scale dataset for the language in the PDTB framework. However, the data is not openly available, is restricted to the news domain, and is by now 35 years old. In this paper, we present and evaluate a new open-access, multi-genre benchmark for PDTB-style shallow discourse parsing, based on the existing UD English GUM corpus, for which discourse relation annotations in other frameworks already exist. In a series of experiments on cross-domain relation classification, we show that while our dataset is compatible with PDTB, substantial out-of-domain degradation is observed, which can be alleviated by joint training on both datasets.",
}
If you are using the OntoNotes schema version of the coreference annotations (a.k.a. OntoGUM data in coref/ontogum/
), please cite this paper instead:
@InProceedings{ZhuEtAl2021,
author = {Yilun Zhu and Sameer Pradhan and Amir Zeldes},
booktitle = {Proceedings of ACL-IJCNLP 2021},
title = {{OntoGUM}: Evaluating Contextualized {SOTA} Coreference Resolution on 12 More Genres},
year = {2021},
pages = {461--467},
address = {Bangkok, Thailand}
For papers focusing on named entities or entity linking (Wikification), please cite this paper instead:
@inproceedings{lin-zeldes-2021-wikigum,
title = {{W}iki{GUM}: Exhaustive Entity Linking for Wikification in 12 Genres},
author = {Jessica Lin and Amir Zeldes},
booktitle = {Proceedings of The Joint 15th Linguistic Annotation Workshop (LAW) and
3rd Designing Meaning Representations (DMR) Workshop (LAW-DMR 2021)},
year = {2021},
address = {Punta Cana, Dominican Republic},
url = {https://aclanthology.org/2021.law-1.18},
pages = {170--175},
}
-
2024-10-29
- Added PDTB-style shallow discourse relations from GDTB in MISC
- Added CxnElt in MISC
- Moved Polarity=Neg of negative morphological derviations to MISC Negation=Yes
- Added ExtPos to fixed expressions in FEATS
- Renamed :npmod and :tmod relation subtypes to :unmarked
-
2024-02-15
- Added GUM V10 documents (four new genres: court, essay, letter and podcast)
-
2023-10-31
- Added eRST annotations in MISC Discourse, incl. multiple concurrent discourse relations and discourse relation signals
- Added morphological segmentation in MISC MSeg
- Added Construction Grammar annotations in MISC Cxn
- Added second human written document summaries as summary2 in metadata for the test set
-
2023-02-02
- Added GUM V9 documents (train only)
- Added document summaries to metadata
- Added salient entity metadata
-
2022-10-21
- Added new subtypes advcl:relcl and nsubj:outer
- Many updates to UPOS and FEATS consistency with EWT
-
2022-04-29
- Added Centering Theory annotations
-
2022-01-31
- Revised RST discourse relations (now 32 labels + ROOT)
-
2022-01-09
- Added GUM V8 documents
-
2021-12-14
- Corrections incl. bug fix for escaping wikification identifiers containing hyphens
- Added more exhaustive PronType annotations
-
2021-10-31
- Add annotated newpar comments representing possibly nesting blocks
- Add XML MISC attribute for XML markup in source data which does not correspond to paragraph blocks
- Shorten Entity mention span closers in MISC
- Add information status and coref type annotations to spans incl. discourse deixis, predicatives, singletons etc.
- Add MIN IDs for fuzzy coref matching scores (mostly NP heads, but more for coordinations and proper names)
-
2021-09-23
- split hyphenated tokens to match EWT tokenization, added
HYPH
xpos tag - added tree depth information in discourse dependencies, allowing reconstruction of RST constituents
- added
_m
suffix to multinuclear discourse dependencies (distinguishes multinuclear and satellite restatements)
- split hyphenated tokens to match EWT tokenization, added
-
2021-05-01
- Added MWTs
- Added metadata
- Comprehensive corrections
-
2021-03-10
- Added enhanced dependencies
-
2021-01-20
- Added documents from four new genres: conversation, speeches, textbooks and vlogs
- Added Wikification annotations
- Added bridging and split antecedent anaphora to MISC
- Improved FEATS, now including Abbr and NumForm
- Added sentence addressee annotations
- Rebalanced splits to account for new genres
-
2020-10-31
- Major improvements to entity and coreference consistency
- Removed 'quantity' entity type
- Added discourse dependency information in MISC column
- Moved Typo annotation from MISC to FEATS
-
2020-03-06
- Added rest of GUM6 documents
- Added entity and coreference annotations to the MISC column
- Changed prepositional TO xpos to IN
- Cardinal number lemmas are now numbers, not @card@
- Identified more cases of orphan
- Numerous corrections
-
2019-10-31v2.5
- Added three new documents from GUM6 preview
- Introduced use of the list relation according to the guidelines
- Overhaul of flat relations for non-personal names
- Numerous sporadic error corrections, and systematic overhaul of some lemmas throughout
-
2019-03-21
- Added new documents from GUM version 5
- Numerous error corrections, now conforming to UD2.4 validation
-
2018-11-08 v2.3
- Added 'multiple' s_type annotation value (formerly subsumed in 'other')
- Numerous error corrections
-
2018-07-01 v2.2
- First official release
=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: UD v2.2 License: CC BY-NC-SA 4.0 Includes text: yes Genre: academic blog email fiction government legal news nonfiction social spoken web wiki Lemmas: manual native UPOS: converted from manual XPOS: manual native Features: converted from manual Relations: manual native Contributors: Peng, Siyao;Zeldes, Amir Contributing: elsewhere Contact: amir.zeldes@georgetown.edu ===============================================================================