Skip to content

Latest commit

 

History

History
106 lines (91 loc) · 6.44 KB

README.md

File metadata and controls

106 lines (91 loc) · 6.44 KB

GisPy: A Tool for Measuring Gist Inference Score in Text

Updates:

[06-26-2024] There was a minor bug in the implementation of WRDHYPnv that has now been fixed (thanks to @amethystant). It is worth considering this update when reproducing the experiments in our paper.

What is Gist?

Based on Fuzzy-trace theory (FTT), when individuals read a piece of text, there are two mental representations encoded in parallel in their mind including 1) gist and 2) verbatim. While verbatim is related to surface-level information in the text, gist represents the bottom-line meaning and underlying semantics of it.

Inspired by the definition of Gist Inference Score (GIS) by Wolfe et al. (2019) and implementation of coherence/cohesion indices in Coh-Metrix, we developed GisPy, a tool for measuring GIS in text.

How to run GisPy

  1. Install the requirements: pip install -r requirements.txt
    • We suggest you create a new virtual environment (e.g., a conda enviroment).
    • If you only want to run GisPy and don't need to run jupyter notebooks, you can skip installing the following packages:
      • matplotlib, textract, wayback
  2. Install the spaCy model: python -m spacy download en_core_web_trf
  3. Put all text documents separately as .txt files (one document per file) in the /data/documents folder.
    • Paragraphs in each document need to be spearated by [at least] one new line character (\n).
  4. Run /gispy/run.py class: python run.py [OUTPUT_FILE_NAME]
    • OUTPUT_FILE_NAME: name of the output file in .csv format where results will be saved.
  5. The output file contains the following information:
    • GIS score for each document in a column named gis
    • Indices and the z-scores of indices

ℹ️ Important

GIS will be computed based on the indices listed in gis_config.json file. This file is a dictionary of indices with their associated weights to give you maximum flexibility about how to use GisPy indices when computing the GIS scores. You can pick any of the indices from the following table (List of GisPy indices). By default in the config file, we have listed the indices that are used in the original GIS formula. Format of the config file is like the following:

{
  "index_1": weight of index_1,
  ...
  "index_n": weight of index_n
}

An example:

{
  "PCREF_ap": 1,
  "PCDC": 1,
  "SMCAUSe_1p": 1,
  "SMCAUSwn_a_binary": -1,
  "PCCNC_megahr": -1,
  "WRDIMGc_megahr": -1,
  "WRDHYPnv": -1
}

weight is a real number that will be multiplied by the mean of index values when we linearly combine the index values in the GIS formula. If you want to ignore an index, you can either not include it in the dictionary at all, or you can simply set its weight to 0.

List of GisPy indices

In the following, there is a list of all indices generated by/in GisPy. To make it easier to map these indices with Coh-Metrix indices, we mainly followed Coh-Metrix indices’ names with some minor modifications (e.g., using different postfixes to show the exact implementation method for each index if there are multiple implementations).

Index Implementations
Number of Paragraphs DESPC
Number of Sentences DESSC
Referential Cohesion CoREF, PCREF_1, PCREF_a, PCREF_1p, PCREF_ap
Deep Cohesion PCDC
Semantic Verb Overlap SMCAUSe_1, SMCAUSe_a, SMCAUSe_1p, SMCAUSe_ap
WordNet Verb Overlap SMCAUSwn_1p_path, SMCAUSwn_1p_lch, SMCAUSwn_1p_wup, SMCAUSwn_1p_binary, SMCAUSwn_ap_path, SMCAUSwn_ap_lch, SMCAUSwn_ap_wup, SMCAUSwn_ap_binary, SMCAUSwn_1_path, SMCAUSwn_1_lch, SMCAUSwn_1_wup, SMCAUSwn_1_binary, SMCAUSwn_a_path, SMCAUSwn_a_lch, SMCAUSwn_a_wup, SMCAUSwn_a_binary
Word Concreteness PCCNC_megahr, PCCNC_mrc
Imageability WRDIMGc_megahr, WRDIMGc_mrc
Hypernymy Nouns & Verb WRDHYPnv

List of files

Gist Inference Score (GIS) formula

GIS = Referential Cohesion 
      + Deep Cohesion 
      + (LSA Verb Overlap - WordNet Verb Overlap) 
      - Word Concreteness 
      - Imageability 
      - Hypernymy Nouns & Verbs

Citation

@inproceedings{hosseini-etal-2022-gispy,
    title = "{G}is{P}y: A Tool for Measuring Gist Inference Score in Text",
    author = "Hosseini, Pedram  and
      Wolfe, Christopher  and
      Diab, Mona  and
      Broniatowski, David",
    booktitle = "Proceedings of the 4th Workshop of Narrative Understanding (WNU2022)",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.wnu-1.5",
    doi = "10.18653/v1/2022.wnu-1.5",
    pages = "38--46",
    abstract = "Decision making theories such as Fuzzy-Trace Theory (FTT) suggest that individuals tend to rely on gist, or bottom-line meaning, in the text when making decisions. In this work, we delineate the process of developing GisPy, an opensource tool in Python for measuring the Gist Inference Score (GIS) in text. Evaluation of GisPy on documents in three benchmarks from the news and scientific text domains demonstrates that scores generated by our tool significantly distinguish low vs. high gist documents. Our tool is publicly available to use at: https: //github.com/phosseini/GisPy.",
}