[06-26-2024] There was a minor bug in the implementation of WRDHYPnv
that has now been fixed (thanks to @amethystant). It is worth considering this update when reproducing the experiments in our paper.
Based on Fuzzy-trace theory (FTT), when individuals read a piece of text, there are two mental representations encoded in parallel in their mind including 1) gist and 2) verbatim. While verbatim is related to surface-level information in the text, gist represents the bottom-line meaning and underlying semantics of it.
Inspired by the definition of Gist Inference Score (GIS) by Wolfe et al. (2019) and implementation of coherence/cohesion indices in Coh-Metrix, we developed GisPy
, a tool for measuring GIS in text.
- Install the requirements:
pip install -r requirements.txt
- We suggest you create a new virtual environment (e.g., a conda enviroment).
- If you only want to run GisPy and don't need to run jupyter notebooks, you can skip installing the following packages:
matplotlib, textract, wayback
- Install the spaCy model:
python -m spacy download en_core_web_trf
- Put all text documents separately as
.txt
files (one document per file) in the/data/documents
folder.- Paragraphs in each document need to be spearated by [at least] one new line character (
\n
).
- Paragraphs in each document need to be spearated by [at least] one new line character (
- Run
/gispy/run.py
class:python run.py [OUTPUT_FILE_NAME]
OUTPUT_FILE_NAME
: name of the output file in.csv
format where results will be saved.
- The output file contains the following information:
- GIS score for each document in a column named
gis
- Indices and the z-scores of indices
- GIS score for each document in a column named
GIS will be computed based on the indices listed in gis_config.json file. This file is a dictionary of indices with their associated weights to give you maximum flexibility about how to use GisPy indices when computing the GIS scores. You can pick any of the indices from the following table (List of GisPy indices). By default in the config file, we have listed the indices that are used in the original GIS formula. Format of the config file is like the following:
{
"index_1": weight of index_1,
...
"index_n": weight of index_n
}
An example:
{
"PCREF_ap": 1,
"PCDC": 1,
"SMCAUSe_1p": 1,
"SMCAUSwn_a_binary": -1,
"PCCNC_megahr": -1,
"WRDIMGc_megahr": -1,
"WRDHYPnv": -1
}
weight
is a real number that will be multiplied by the mean of index values when we linearly combine the index values in the GIS formula. If you want to ignore an index, you can either not include it in the dictionary at all, or you can simply set its weight
to 0
.
In the following, there is a list of all indices generated by/in GisPy. To make it easier to map these indices with Coh-Metrix indices, we mainly followed Coh-Metrix indices’ names with some minor modifications (e.g., using different postfixes to show the exact implementation method for each index if there are multiple implementations).
Index | Implementations |
---|---|
Number of Paragraphs | DESPC |
Number of Sentences | DESSC |
Referential Cohesion | CoREF , PCREF_1 , PCREF_a , PCREF_1p , PCREF_ap |
Deep Cohesion | PCDC |
Semantic Verb Overlap | SMCAUSe_1 , SMCAUSe_a , SMCAUSe_1p , SMCAUSe_ap |
WordNet Verb Overlap | SMCAUSwn_1p_path , SMCAUSwn_1p_lch , SMCAUSwn_1p_wup , SMCAUSwn_1p_binary , SMCAUSwn_ap_path , SMCAUSwn_ap_lch , SMCAUSwn_ap_wup , SMCAUSwn_ap_binary , SMCAUSwn_1_path , SMCAUSwn_1_lch , SMCAUSwn_1_wup , SMCAUSwn_1_binary , SMCAUSwn_a_path , SMCAUSwn_a_lch , SMCAUSwn_a_wup , SMCAUSwn_a_binary |
Word Concreteness | PCCNC_megahr , PCCNC_mrc |
Imageability | WRDIMGc_megahr , WRDIMGc_mrc |
Hypernymy Nouns & Verb | WRDHYPnv |
- Benchmark 1: wolfe_reports_editorials.csv
- Benchmark 2: wolfe_methods_discussion.csv
- Benchmark 3: Disney
experiments.ipynb
: all experiments including the robustness tests on three benchmarks.benchmarks.ipynb
: preprocessing Wolfe's benchmark files.
GIS = Referential Cohesion
+ Deep Cohesion
+ (LSA Verb Overlap - WordNet Verb Overlap)
- Word Concreteness
- Imageability
- Hypernymy Nouns & Verbs
@inproceedings{hosseini-etal-2022-gispy,
title = "{G}is{P}y: A Tool for Measuring Gist Inference Score in Text",
author = "Hosseini, Pedram and
Wolfe, Christopher and
Diab, Mona and
Broniatowski, David",
booktitle = "Proceedings of the 4th Workshop of Narrative Understanding (WNU2022)",
month = jul,
year = "2022",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.wnu-1.5",
doi = "10.18653/v1/2022.wnu-1.5",
pages = "38--46",
abstract = "Decision making theories such as Fuzzy-Trace Theory (FTT) suggest that individuals tend to rely on gist, or bottom-line meaning, in the text when making decisions. In this work, we delineate the process of developing GisPy, an opensource tool in Python for measuring the Gist Inference Score (GIS) in text. Evaluation of GisPy on documents in three benchmarks from the news and scientific text domains demonstrates that scores generated by our tool significantly distinguish low vs. high gist documents. Our tool is publicly available to use at: https: //github.com/phosseini/GisPy.",
}