Skip to content

Latest commit

 

History

History
98 lines (52 loc) · 2.88 KB

2.md

File metadata and controls

98 lines (52 loc) · 2.88 KB

2. NLP tasks, data sets, benchmarks

Explanations and visualisations

 

Text parsing

Because language is compositional, text parsing is performed at several levels.

Tokenisation

Here we decide what the units of processing are. In the CoNLL-like formats, tokenisation is deciding what goes in each row. Traditionally, each word is considered to be one token. But what is a word? What about punctuation?

splits

 

Lemmatisation

Mapping different word forms into a single canonical form, e.g. journaux -> journal. It can be very difficult for some language due to:

  • non-concatenative morphology
  • not clear difference between derivation and morphology
  • no clear word boundaries (e.g. Chinese)

Morphology

splitssplitssplits

Derivation

splits

 

Part-of-speech (PoS) tagging or morphosyntactic definition (MSD)

Classifying tokens into categories, e.g. VERB, NOUN. If a language has rich morphology (like Latin), we need additional features called morphosyntactic definitions, e.g. NOUN in the ACCUSATIVE case SINGULAR, MASCULINE gender

splits

 

Syntactic parsing

How tokens combine into phrases and sentences:

No labels

splits

 

Constituent analysis

splits

 

Dependency analysis

splits

 

End-user tasks

  • Examples in the HuggingFace tutorial:

    • sentiment analysis: given a short text, is it positive or negative?
    • named entity recognition: given a token, is it an ordinary word or does it refer to a specific real entity?
    • question answering: given a question and a text snippet, what segments of the text respond to the question?
    • mask filling: given a sentence with empty slots, what tokens suit best the empty slots?
    • translation
    • summarisation
    • text generation
  • Famous (old) NLU benchmarks and data sets: