Skip to content
Sean A Fulop edited this page Nov 27, 2016 · 4 revisions

The Penn Treebank is probably well known to you, if you are reading this. To briefly summarize, it is a large corpus of sentences which have been annotated with complete syntactic structures and parts of speech, using a rich set of part-of-speech tags and n-ary branched tree structures. It is also proprietary, so you will only be able to use our code if you already have access to the Penn Treebank.

The PennTreebank-Transcoder code is still in final development, but it is basically working as planned. Some more tweaking is needed to parse all the Treebank sentences without errors (this is why it is on Github, so tweak away). The purpose of the code is to transcode the Penn Treebank annotated sentences into pairs of structures: On one side, we output a "lambda term" -- this is a term of the lambda calculus which represents the rough semantic structure of the original sentence, in particular its function-argument structure. Strictly speaking we extend the lambda calculus vocabulary of terms, so some of our output terms are no longer legal lambda terms but they serve a purpose here to be described later. On the other side, we output a bare structure -- this is a bracketed tree that has only binary or unary branching, and which corresponds to the lambda term by a generalized Curry-Howard morphism.

Lambda terms only allow binary branching, strictly speaking, although we have extended this notation to include unary branching. The first obstacle with the Penn Treebank is that it allows n-ary branching. Lambda terms also require us to know what is the head (functor) that applies to an argument -- this is not immediately clear from the Penn Treebank. These kinds of problems with the Treebank were already solved by Jason Eisner, who graciously allowed us to link to his code and incorporate it into this project.

The first step of parsing the Penn Treebank is to use the BASH script 'preprocess.sh' as described in the README. Let us now detail the steps therein. For more comprehensive descriptions check the comments in the actual script code by Jason Eisner.

fixsay.pl - this tries to fix sentence trees with "so-and-so says" or "says so-and-so" in the middle of the sentence, which in the Treebank are treated as parentheticals.

markargs.pl - using the rules from Collins ["Three generative, lexicalized models for statistical parsing," (1997)], marks all constituents that are arguments of verbs, prepositions, or complementizers.

canonicalize.pl - strips suffixes from nonterminal node labels and equalizes a number of null lexeme labels.

articulate.pl - flesh out some structure that the Treebank leaves flat, and fix some annotator errors.

discardconj.pl - discards all sentence trees containing conjunctions - this is required because simple lambda terms like we are using cannot represent conjunctions.

discardbugs.pl - discards sentences that appear to contain annotation errors.

headify.pl - marks the head subconstituent of each constituent when it can be guessed.

headall.pl - ensures all constituents have a head marked.

binarize.pl - munges the trees so that no node has more than two daughters.

That concludes the preprocess script that is currently being used. All PERL code by Jason Eisner of Johns Hopkins University.

After the Treebank input has been through the preprocessing it is ready to be fed to the code we actually wrote. This code comprises two components, a lexer/tokenizer written in Flex, and a parser written in Bison/C.

plexer.l

This lexer tries to find the nonterminal node labels and part-of-speech tags in the Treebank, as well as the clause/sentence labels. It also tries to find the terminal labels which are the words. It tries to find punctuation that can be discarded. We need work to help tweak this lexer, it is not 100% good yet.

pgrammar.y

This parser uses Bison augmented with plenty of C code. It has (we hope) enough parsing rules to cover all cases of the kinds of structures remaining in the preprocessed Treebank data. The objective is to convert each Treebank structure into a two-sided output structure that comprises a bracketed tree over the words together with a simply typed lambda term showing the semantic structure including the head-argument structure.

So a typical Treebank tree for "John loves Mary" would yield output like the following:

[John, [loves, Mary]]:type(type(type(loves, _)@type(Mary, _),_)@type(John, _), s)

The bare syntactic tree appears on the left of the colon, the unsubtyped lambda term appears on the right of the colon. A lambda term is "unsubtyped" when only the sentence types are known; the internal subterm types are left as underscores rather than explicitly labeled.

The lambda terms here are always application terms. This is notated as the functor type(head, _) applying by the '@' operator to the argument type(arg, _). The whole term also has the same data structure:

type(type(head, _)@type(arg, _), _)

The functor/argument for the terms are just the ones that were identified in the preprocessing stage by Eisner's scripts.

Well, that's all for now. This code does not work perfectly at the moment, although it does succeed on very many Treebank sentences. The problems lie with the lexer and maybe a couple of parsing rules in the grammar. We invite anyone interested in getting a lambda-term Treebank to help fix the code and perfect how well it parses the Penn Treebank.

Clone this wiki locally