Explanations and visualisations
- Crash Course Linguistics #2, #3, #4
- Universal Dependencies, CoNLL-U format
- Jurafsky-Martin 2.4
Because language is compositional, text parsing is performed at several levels.
Here we decide what the units of processing are. In the CoNLL-like formats, tokenisation is deciding what goes in each row. Traditionally, each word is considered to be one token. But what is a word? What about punctuation?
Mapping different word forms into a single canonical form, e.g. journaux -> journal. It can be very difficult for some language due to:
- non-concatenative morphology
- not clear difference between derivation and morphology
- no clear word boundaries (e.g. Chinese)
Morphology
Derivation
Classifying tokens into categories, e.g. VERB, NOUN. If a language has rich morphology (like Latin), we need additional features called morphosyntactic definitions, e.g. NOUN in the ACCUSATIVE case SINGULAR, MASCULINE gender
How tokens combine into phrases and sentences:
No labels
Constituent analysis
Dependency analysis
-
Examples in the HuggingFace tutorial:
- sentiment analysis: given a short text, is it positive or negative?
- named entity recognition: given a token, is it an ordinary word or does it refer to a specific real entity?
- question answering: given a question and a text snippet, what segments of the text respond to the question?
- mask filling: given a sentence with empty slots, what tokens suit best the empty slots?
- translation
- summarisation
- text generation
-
Famous (old) NLU benchmarks and data sets: