Skip to content

This repository presents an in-depth study of SummaRuNNer (ablation and replacement study) and also presents SummaRuNNer's results on NYT50 and part of the French wikipedia, as well as the influence of the length of the summary in relation to the document on performance.

Notifications You must be signed in to change notification settings

rhfdn/SummaRuNNer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SummaRuNNer (extractive summarization)

This repository presents an in-depth study of SummaRuNNer (ablation and replacement study) and also presents SummaRuNNer's results on CNN-DailyMail, NYT50 and part of the French wikipedia, as well as the influence of the length of the summary/document ratio on performance. It also presents the influence of the named entity recognition task when combined with the summarization task and vice versa.

paper: SummaRuNNer

Clone project

git clone https://github.com/Baragouine/SummaRuNNer.git

Enter into the directory

cd SummaRuNNer

Create environnement

conda create --name SummaRuNNer python=3.9

Activate environnement

conda activate SummaRuNNer

Install dependencies

pip install -r requirements.txt

Install nltk data

To install nltk data:

  • Open a python console.
  • Type import nltk; nltk.download().
  • Download all data.
  • Close the python console.

Convert initial dataset to valid pandas json

Download the initial dataset (CNN-DailyMail from Kaggle).
Copy train.json, val.json and test.json to ./data/cnn_dailymail/raw/ .
Run the notebook: 00-0-convert_raw_cnndailymail_to_json.ipynb.

To know how to download and preprocess NYT (convert NYT to NYT50 and preprocess it), see: https://github.com/Baragouine/HeterSUMGraph.

Compute labels

python3 ./00-1-compute_label_cnndailymail.py

You can adapt this script for other dataset containing texts and summaries.

Embeddings

For training you must use glove 100 embeddings, they must have the following path: data/glove.6B/glove.6B.100d.txt

Training

Run train_RNN_RNN.ipynb to train the paper model.
Run train_SIMPLE_CNN_RNN.ipynb to train the model whose first RNN is replaced by a single-layer CNN.
Run train_COMPLEX_CNN_RNN.ipynb to train the model whose the first RNN is replaced by a complex CNN (3 layers).
Run train_COMPLEX_CNN_RNN_max_pool.ipynb to train the model whose the first RNN is replaced by a complex CNN (3 layers). And the model uses max pooling instead of average pooling.
Run train_RES_CNN_RNN.ipynb to train the model whose first RNN is replaced by a CNN that uses residual connections (3 layers).

The other notebooks are used to train SIMPLE_CNN_RNN that have been ablated to see the importance of each component of the model:

  • Run train_SIMPLE_CNN_RNN_abs_pos_only.ipynb to train a SIMPLE_CNN_RNN that uses only the absolute position to predict.
  • Run train_SIMPLE_CNN_RNN_rel_pos_only.ipynb to train a SIMPLE_CNN_RNN that uses only the relative position to predict.
  • ...

To find out what these notebooks are for, just look at the file name.
The pt files are located in ./checkpoints, each training result is stored in a different sub directory.

Result

CNN-DailyMail (full-length f1 rouge)

model ROUGE-1 ROUGE-2 ROUGE-L
SummaRuNNer(Nallapati) 39.6 ± 0.2 16.2 ± 0.2 35.3 ± 0.2
RNN_RNN 39.7 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
COMPLEX_CNN_RNN 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
COMPLEX_CNN_RNN_max_pool 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
RES_CNN_RNN 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_without_text_content (positions only) 39.4 ± 0.0 16.0 ± 0.0 24.2 ± 0.0
SIMPLE_CNN_RNN_without_positions (text content only) 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_absolute_position_only 39.4 ± 0.0 16.0 ± 0.0 24.3 ± 0.0
SIMPLE_CNN_RNN_relative_position_only 39.0 ± 0.0 15.8 ± 0.0 24.1 ± 0.0
SIMPLE_CNN_RNN_without_positions_and_content 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_without_positions_and_salience 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_without_position_and_novelty 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_novelty_only 40.0 ± 0.0 16.7 ± 0.0 25.3 ± 0.0
SIMPLE_CNN_RNN_salience_only 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0
SIMPLE_CNN_RNN_content_only 39.6 ± 0.0 16.2 ± 0.0 24.4 ± 0.0

RNN_RNN on NYT50 (limited-length ROUGE Recall)

model ROUGE-1 ROUGE-2 ROUGE-L
HeterSUMGraph (Wang) 46.89 26.26 42.58
RNN_RNN 47.3 ± 0.0 26.7 ± 0.0 35.7* ± 0.0

*: maybe the ROUGE-L have changed in the rouge library I use.

RNN_RNN on general geography, architecture town planning and geology French wikipedia articles (limited-length ROUGE Recall)

dataset ROUGE-1 ROUGE-2 ROUGE-L
Wikipedia-0.5 31.5 ± 0.0 10.0 ± 0.0 20.0 ± 0.0
Wikipedia-high-25 24.0 ± 0.0 6.8 ± 0.0 15.0 ± 0.0
Wikipedia-low-25 33.3 ± 0.0 13.3 ± 0.0 23.0 ± 0.0

RNN_RNN with NER on general geography, architecture town planning and geology French wikipedia articles (limited-length ROUGE Recall)

model ROUGE-1 ROUGE-2 ROUGE-L ACCURACY
RNN_RNN_summary_and_ner 31.6 ± 0.1 10.0 ± 0.0 20.0 0.875 ± 0.0
RNN_RNN_OnlyNER N/A N/A N/A 0.879 ± 0.0

* Wikipedia-0.5: general geography, architecture town planning and geology French wikipedia articles with len(summary)/len(content) <= 0.5.
* Wikipedia-high-25: first 25% of general geography, architecture town planning and geology French wikipedia articles sorted by len(summary)/len(content) descending.
* Wikipedia-low-25: first 25% of general geography, architecture town planning and geology French wikipedia articles sorted by len(summary)/len(content) ascending.

See HSG_ExSUM_NER repository for wikipedia scraping and preprocessing (that repository conatain script for scrapping and preprocessing).

About

This repository presents an in-depth study of SummaRuNNer (ablation and replacement study) and also presents SummaRuNNer's results on NYT50 and part of the French wikipedia, as well as the influence of the length of the summary in relation to the document on performance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published