2. NLP tasks, data sets, benchmarks

Explanations and visualisations

Crash Course Linguistics #2, #3, #4

Universal Dependencies, CoNLL-U format

Jurafsky-Martin 2.4

Text parsing

Because language is compositional, text parsing is performed at several levels.

Tokenisation

Here we decide what the units of processing are. In the CoNLL-like formats, tokenisation is deciding what goes in each row. Traditionally, each word is considered to be one token. But what is a word? What about punctuation?

Lemmatisation

Mapping different word forms into a single canonical form, e.g. journaux -> journal. It can be very difficult for some language due to:

non-concatenative morphology
not clear difference between derivation and morphology
no clear word boundaries (e.g. Chinese)

Morphology

Derivation

Part-of-speech (PoS) tagging or morphosyntactic definition (MSD)

Classifying tokens into categories, e.g. VERB, NOUN. If a language has rich morphology (like Latin), we need additional features called morphosyntactic definitions, e.g. NOUN in the ACCUSATIVE case SINGULAR, MASCULINE gender

Syntactic parsing

How tokens combine into phrases and sentences:

No labels

Constituent analysis

Dependency analysis

End-user tasks

Examples in the HuggingFace tutorial:
- sentiment analysis: given a short text, is it positive or negative?
- named entity recognition: given a token, is it an ordinary word or does it refer to a specific real entity?
- question answering: given a question and a text snippet, what segments of the text respond to the question?
- mask filling: given a sentence with empty slots, what tokens suit best the empty slots?
- translation
- summarisation
- text generation
Famous (old) NLU benchmarks and data sets:
- GLUE
- SQuAD
- SNLI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.md

2.md

2. NLP tasks, data sets, benchmarks

Text parsing

Tokenisation

Lemmatisation

Part-of-speech (PoS) tagging or morphosyntactic definition (MSD)

Syntactic parsing

End-user tasks

Files

2.md

Latest commit

History

2.md

File metadata and controls

2. NLP tasks, data sets, benchmarks

Text parsing

Tokenisation

Lemmatisation

Part-of-speech (PoS) tagging or morphosyntactic definition (MSD)

Syntactic parsing

End-user tasks