This is the source code for a system to automatically disambiguate potentially idiomatic expressions (PIEs, for short) in text. It implements four methods of doing so: a baseline most-frequent-sense method, a baseline canonical form-based method (Fazly et al., 2009), a lexical cohesion graph-based method (Sporleder & Li, 2009), and a variation on that method using literal representations of idioms' figurative senses. It evaluates those methods on a combination of four corpora, the VNC-Tokens corpus, the IDIX corpus, the PIE Corpus, and the SemEval-2013 Task 5b dataset. For a detailed description of the systems, see our LAW-MWE-CxG paper.
To run this code, you'll need the following Python setup:
- Python 2.7.6
- beautifulsoup4 4.5.1
- numpy 1.14.0
- scipy 0.19.1
- spacy 2.0.6 + en_core_web_sm 2.0.0
Different versions might work just as well, but cannot be guaranteed.
You'll also need:
- the British National Corpus
- the GloVe embeddings
- the VNC-Tokens Dataset
- the IDIX Corpus
- the PIE Corpus
- the SemEval-2013 Task 5b Dataset
- Clone the repository
- Create subdirectories called
working
andext
- Add these symlinks (or edit
config.py
):- create a symlink
ext/BNC
to theTexts
directory of your copy of the BNC - create a symlink
ext/glove
to the directory containing the GloVe embeddings - create symlinks
ext/VNC
,ext/IDIX
,ext/PIE_Corpus
, andext/SemEval
to the main directory of the respective corpora
- create a symlink
- Try and run the system with
python psd.py -c 0 -m cg -gs 0s
. This should run a basic lexical cohesion graph method and evaluate on the development set of the combined corpora. - Get an overview of all options by simply running
python psd.py --help
For any questions about (running) the system, feel free to contact me.