PredPatt employs deterministic, unlexicalized rules over UD parses: a detailed description of PredPatt’s rules is available here. For a collection of linguistically-motivated examples see our documentation tests.
Here we provide a high-level overview of the process and examples.
- Predicate and argument root identification
- Argument resolution
- Predicate and argument phrase extraction
- Optional Post-processing
UD Parse A universal dependency (UD) parse, is a set of labeled pairs of the
form relation(dependent,governor)
. The UD parse also includes a sequence of
Universal POS tags. An example of a UD parse is given in Figure 1.
Predicate and argument root identification Predicate and argument
roots (i.e., dependency tree nodes) are identified by local
configurations—specifically edges in the UD parse. The simplest example
is nsubj(s, v)
and dobj(o, v)
, which indicate that v
is a
predicate root, and that s
and o
are argument roots. Similarly,
roots of clausal subjects and clausal complements are also predicate
roots. Nominal modifiers inside adverbial modifiers are arguments to the
verb being modified, e.g., Investors turned away from [the stock
market]. PredPatt also extracts relations from appositives,
possessives, copula, and adverbial modifiers.
Argument resolution For example, the sentence Chris expects to visit Pat
is missing the explicit arc nsubj(Chris, visit)
because UD analysis does not
include traces nodes. PredPatt includes argument resolution rules to handle
missing arguments of many syntactic constructions, including predicate
coordination, relative clauses, and embedded clauses. Argument resolution is
crucial in languages that mark arguments using morphology, such as Spanish and
Portuguese, because there are more cases of covert subjects. Other common cases
for argument resolution are when predicates appear in a conjunction, e.g.,
Chris likes to sing and dance, has no arc from dance to its subject
Chris. In relative clauses, the arguments of an embedded clause appear outside
the subtree, e.g., borrowed in The books John borrowed from the library are
overdue. has books as an argument and so does are-overdue.
Predicate phrase extraction PredPatt extract a descriptive name for complex
predicate. For example, "[PredPatt] finds [structure] in [text]".
has a 3-place predicate named (?a
finds ?b
in ?c
). The primary
logic here is (a) to lift mark and case tokens (e.g., in) out of the argument
subtree, (b) to add adverbial modifiers, auxiliaries, and negation (e.g.,
"[Chris] did not sleep quietly"). PredPatt uses the text order of tokens
and arguments to derive a name for the predicate; no effort is made to further
canonicalize this name, nor align it to a verb ontology.
Argument phrase extraction Argument extraction filters tokens from the dependency subtree below the argument root. These filters primarily simplify the subtree, e.g., removing relative clauses and appositives inside an argument. The default set of filters were chosen to preserve meaning, since it is not generally the case that all modifiers can safely be dropped (more aggressive argument simplification settings are available as options).
Post-processing PredPatt implements a number of optional post-processing routines, such as conjunction expansion, argument simplification (which filters out non-core arguments, leaving only subjects and objects), and language-specific hooks.[6] For example, conjunctions appearing inside arguments may be optionally expanded, e.g.:
Chris loves cake and ice cream.
⇒ [Chris] loves [cake]
⇒ [Chris] loves [ice cream]
where these optional steps may lead to errors, such as here with respect to distributivity:
Chris and Pat are a team
⇒ * [Chris] are a team
⇒ * [Pat] are a team
[6]: UD itself allows for language-specific exceptions to the ``universal'' standard, and we therefore allow that practice here.}