Skip to content

Ein Projekt für den Kurs Computerlinguistische Techniken im Wintersemester 2020/21

Notifications You must be signed in to change notification settings

katjakon/Terminology-Extraction-CL

Repository files navigation

Terminology-Extraction-CL

Description

This program extracts bigrams that are relevant terminology for a domain from a given corpus. To do this, the program weighs the domain relevance and the domain consensus of a term and adds it to the terminology if it exceeds a threshold. ntlkt's reuters corpus is used as a neutral reference corpus.

Requirements

All files should be utf-8 encoded.

  • Python 3.8.5
  • NLTK (Natural Language Tool KIT) - See installation instructions here

NLTK Data:

  • Reuters Corpus - See here for more information
>>> import nltk
>>> nltk.download('reuters')
  • NLTK's Averaged Perceptron Tagger - See here for more information
>>> import nltk
>>> nltk.download('averaged_perceptron_tagger')
  • NLTK's Word Punctuation Tokenizer - See here for more information
>>> import nltk
>>> nltk.download('punkt')

How To Use

Move the domain corpus (standard: acl_texts) to this directory. The corpus should be a directory of text files.

Generate Candidates

To extract terminology for a domain, you have to choose possible candidates first. A predefined list of candidates can be found in the file data/candidates1.txt.

To generate your own list run:
main.py candidates [--stops <stopword file>] [--min_count <integer>] <domain dir> <output file> [<tag> [<tag> ...]]

Explanation:

  • --stops <stopword file>: A file with stopwords that are not allowed to occur in a candidate. Bigrams that contain a word from this file are filtered out. If argument is left out, no stopwords will be used.
  • --min_count <integer>: The minimum absolute frequency a bigram has to have to be considered a candidate. The default is 4.
  • <domain dir>: The directory of the domain corpus.
  • <output file>: The name for your output file containing the candidates.
  • [<tag> [<tag> ...]]: Any number of Penn Treebank Tags. A tagged bigram needs to contain at least one of these tags to be considered a candidate. If argument is left out, no tagging will be used.

To reproduce the candidates in data/candidates.txt run:
main.py candidates --stops data/stops_en.txt --min_count 3 acl_texts/ <your file name> NNS NN NNP

Extract Terminology

Use a file with candidates and the domain corpus to extract relevant terminology. Your results will be saved to a csv file with ; as a delimiter. The first two lines contain the value for alpha and theta. After that, each line has three columns <term>;<value>;<True/False>. The first contains the term, the second the value of the decision function and the third whether the term is considered terminology or not. Run:
main.py extract -a <value for alpha> -t <value for theta> <domain dir> <candidates file> <output file>

Explanation:

  • -a <value for alpha>: A float between 0 and 1. Used to weigh domain consensus and domain relevance. If greater than 0.5 domain relevance has more weight, if less than 0.5 domain consenus has more weight.
  • -t <value for theta>: A positive float. Used as a threshold when determining terminology.
  • <domain dir>: Directory of domain corpus. Standard should be acl_texts.
  • <candidates file>: A file with candidates, generated by main.py candidates.
  • <output file> : The name for the output file where extracted terms are stored.

Example:
main.py extract -a 0.5 -t 2 acl_texts/ data/candidates1.txt output/output1.csv

Evaluate Extracted Terms

Compare extracted terminology to a gold standard by computing recall, precision and F1-score. To evaluate extracted terms run:
main.py evaluate --extracted <term file> --gold <gold file> [--high <int>] [--low <int>]

Explanation:

  • --extracted <term file>: A file with extracted terms, generated by main.py extract.
  • --gold <gold file>: A file with gold standard terminology. Standard should be gold_terminology.txt. Each line should contain on term.
  • --high <int>: Optionally, define an integer and print out the n highest scored terms.
  • --low <int>: Optionally, define an integer and print out the n lowest scored terms.

Example:
main.py evaluate --extracted output/output1.csv --gold data/gold_terminology.txt --high 30

Demo

To get a demo of the functionalities run:
main.py demo

To get a demo of the different classes and their key methods, run the respective file. For example, to get a demo of the Evaluation class, run evaluation.py.

Unittests

Run unittests for a class by running the respective test file. For example, to get the tests for the Terminology class, run test_terminology.py.

Author

Katja Konermann
Ein Projekt für die Veranstaltung Computerlinguistische Techniken im Wintersemester 20/21

About

Ein Projekt für den Kurs Computerlinguistische Techniken im Wintersemester 2020/21

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages