This repository presents and compares HeterSUMGraph and variants doing extractive summarization, named entity recognition or both.
This repository also present the influence of the summary/document ratio on performance.
HeterSUMGraph and variants use GATv2Conv (from torch_geometric).
HeterSUMGraph using GATv2Conv is the best variant of HeterSUMGraph, better than original HeterSUMGraph on NYT50, for more information see: https://github.com/Baragouine/HeterSUMGraph
The dataset is a part of general geography, architecture town planning and geology French wikipedia articles.
Warning: this code uses a French Tokenizer.
git clone https://github.com/Baragouine/HSG_ExSUM_NER.git
cd HSG_ExSUM_NER
conda create --name HSG_ExSUM_NER python=3.9
conda activate HSG_ExSUM_NER
pip install -r requirements.txt
To install nltk data:
- Open a python console.
- Type
import nltk; nltk.download()
. - Download all data.
- Close the python console.
preprocessing mean cleaning, labeling, etc. not mean preprocessing before training.
- Run
00-00-scrap_wiki.ipynb
to scrap data. - Run
00-01-raw_dataset_to_preprocessed.ipynb
to compute summarization and ner labels. - Run
00-02-drop_article_without_body.ipynb
to drop articles without body. - Run
00-03-split_preprocessed_dataset_to_25_high_25_low_0.5.ipynb
to split the previous dataset to three subsets depending of summary/article ratio (Wikipedia-0.5, Wikipedia-high-25, Wikipedia-low-25). - Run
00-04-split_wiki_datasets_to_train_val_test.ipynb
to split previous datasets to train, val and test set. - Run
python scripts/compute_tfidf_dataset.py -input data/wiki_geo_ratio_sc_0.5.json -output data/wiki_geo_ratio_sc_0.5_dataset_tfidf.json -docs_col_name flat_contents
(compute tfidfs for whole dataset). - Run
python scripts/compute_tfidf_sent_dataset.py -input data/wiki_geo_ratio_sc_0.5.json -output data/wiki_geo_ratio_sc_0.5_sent_tfidf.json -docs_col_name flat_contents
(compute tfidfs for each document). - Run
python scripts/compute_tfidf_dataset.py -input data/wiki_geo_low_25.json -output data/wiki_geo_low_25_dataset_tfidf.json -docs_col_name flat_contents
(compute tfidfs for whole dataset). - Run
python scripts/compute_tfidf_sent_dataset.py -input data/wiki_geo_low_25.json -output data/wiki_geo_low_25_sent_tfidf.json -docs_col_name flat_contents
(compute tfidfs for each document). - Run
python scripts/compute_tfidf_dataset.py -input data/wiki_geo_high_25.json -output data/wiki_geo_high_25_dataset_tfidf.json -docs_col_name flat_contents
(compute tfidfs for whole dataset). - Run
python scripts/compute_tfidf_sent_dataset.py -input data/wiki_geo_high_25.json -output data/wiki_geo_high_25_sent_tfidf.json -docs_col_name flat_contents
(compute tfidfs for each document).
tfidfs computing is only necessary for HeterSUMGraph based models.
For training you must use french fasttext embeddings, they must have the following path: data/cc.fr.300.vec
Run one of the *train* notebooks to train and evaluate the associated model: The names of notebooks containing HeterSUMGraph mean that they can be used to train HeterSUMGraph. If the name contains GAT, it means that the notebook trains the original version of HeterSUMGraph. If the name contains GATv2, it means that the GAT layer has been replaced by GATv2. If it contains NER without the "Only", it means that the notebook performs summary and named entity recognition. If it contains OnlyNER, it means that the model only performs named entity recognition; if the name contains POL, it means that edge features are taken into account for the NER; finally, if instead of HeterSUMGraph we have HSGRNN, it means that the model is a combination of HeterSUMGraph and SummaRuNNer.
see: https://www.overleaf.com/read/gbfxvfvykxsc#77a14f
dataset | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
Wikipedia-0.5 | 29.1 ± 0.0 | 8.6 ± 0.0 | 18.9 ± 0.0 |
Wikipedia-high-25 | 23.8 ± 0.0 | 6.8 ± 0.0 | 14.9 ± 0.0 |
Wikipedia-low-25 | 33.1 ± 0.0 | 13.3 ± 0.0 | 22.9 ± 0.0 |
model | ROUGE-1 | ROUGE-2 | ROUGE-L | BCELoss |
---|---|---|---|---|
HeterSUMGraph_GAT | 31.11 |
9.79 |
19.59 |
N/A |
HeterSUMGraphNER_GAT | 31.70 |
10.22 |
20.02 |
0.926+/-0.000 |
HeterSUMGraphOnlyNER_GAT | N/A | N/A | N/A | 0.929+/-0.001 |
HeterSUMGraphNERPOL_GAT | N/A | N/A | N/A | N/A |
HeterSUMGraph_GATv2 | 31.56 |
10.12 |
19.91 |
N/A |
HeterSUMGraphNER_GATv2 | 31.66 |
10.22 |
20.01 |
0.925+/-0.001 |
HeterSUMGraphOnlyNER_GATv2 | N/A | N/A | N/A | 0.930+/-0.001 |
HSGRNN_GATv2 | 30.86 |
9.29 |
19.59 |
N/A |
HSGRNNNER_GATv2 | 31.52 |
10.06 |
19.97 |
0.926+/-0.000 |
HSGRNNOnlyNER_GATv2 | N/A | N/A | N/A | 0.930+/-0.001 |
* Wikipedia-0.5: general geography, architecture town planning and geology French wikipedia articles with len(summary)/len(content) <= 0.5.
* Wikipedia-high-25: first 25% of general geography, architecture town planning and geology French wikipedia articles sorted by len(summary)/len(content) descending.
* Wikipedia-low-25: first 25% of general geography, architecture town planning and geology French wikipedia articles sorted by len(summary)/len(content) ascending.