SummaRuNNer (extractive summarization)

This repository presents an in-depth study of SummaRuNNer (ablation and replacement study) and also presents SummaRuNNer's results on CNN-DailyMail, NYT50 and part of the French wikipedia, as well as the influence of the length of the summary/document ratio on performance. It also presents the influence of the named entity recognition task when combined with the summarization task and vice versa.

paper: SummaRuNNer

Clone project

git clone https://github.com/Baragouine/SummaRuNNer.git

Enter into the directory

cd SummaRuNNer

Create environnement

conda create --name SummaRuNNer python=3.9

Activate environnement

conda activate SummaRuNNer

Install dependencies

pip install -r requirements.txt

Install nltk data

To install nltk data:

Open a python console.
Type import nltk; nltk.download().
Download all data.
Close the python console.

Convert initial dataset to valid pandas json

Download the initial dataset (CNN-DailyMail from Kaggle).
Copy train.json, val.json and test.json to ./data/cnn_dailymail/raw/ .
Run the notebook: 00-0-convert_raw_cnndailymail_to_json.ipynb.

To know how to download and preprocess NYT (convert NYT to NYT50 and preprocess it), see: https://github.com/Baragouine/HeterSUMGraph.

Compute labels

python3 ./00-1-compute_label_cnndailymail.py

You can adapt this script for other dataset containing texts and summaries.

Embeddings

For training you must use glove 100 embeddings, they must have the following path: data/glove.6B/glove.6B.100d.txt

Training

Run train_RNN_RNN.ipynb to train the paper model.
Run train_SIMPLE_CNN_RNN.ipynb to train the model whose first RNN is replaced by a single-layer CNN.
Run train_COMPLEX_CNN_RNN.ipynb to train the model whose the first RNN is replaced by a complex CNN (3 layers).
Run train_COMPLEX_CNN_RNN_max_pool.ipynb to train the model whose the first RNN is replaced by a complex CNN (3 layers). And the model uses max pooling instead of average pooling.
Run train_RES_CNN_RNN.ipynb to train the model whose first RNN is replaced by a CNN that uses residual connections (3 layers).

The other notebooks are used to train SIMPLE_CNN_RNN that have been ablated to see the importance of each component of the model:

Run train_SIMPLE_CNN_RNN_abs_pos_only.ipynb to train a SIMPLE_CNN_RNN that uses only the absolute position to predict.
Run train_SIMPLE_CNN_RNN_rel_pos_only.ipynb to train a SIMPLE_CNN_RNN that uses only the relative position to predict.
...

To find out what these notebooks are for, just look at the file name.
The pt files are located in ./checkpoints, each training result is stored in a different sub directory.

Result

CNN-DailyMail (full-length f1 rouge)

model	ROUGE-1	ROUGE-2	ROUGE-L
SummaRuNNer(Nallapati)	39.6 ± 0.2	16.2 ± 0.2	35.3 ± 0.2
RNN_RNN	39.7 ± 0.0	16.2 ± 0.0	24.4 ± 0.0
SIMPLE_CNN_RNN	39.6 ± 0.0	16.2 ± 0.0	24.4 ± 0.0
COMPLEX_CNN_RNN	39.6 ± 0.0	16.2 ± 0.0	24.4 ± 0.0
COMPLEX_CNN_RNN_max_pool	39.6 ± 0.0	16.2 ± 0.0	24.4 ± 0.0
RES_CNN_RNN	39.6 ± 0.0	16.2 ± 0.0	24.4 ± 0.0
SIMPLE_CNN_RNN_without_text_content (positions only)	39.4 ± 0.0	16.0 ± 0.0	24.2 ± 0.0
SIMPLE_CNN_RNN_without_positions (text content only)	39.6 ± 0.0	16.2 ± 0.0	24.4 ± 0.0
SIMPLE_CNN_RNN_absolute_position_only	39.4 ± 0.0	16.0 ± 0.0	24.3 ± 0.0
SIMPLE_CNN_RNN_relative_position_only	39.0 ± 0.0	15.8 ± 0.0	24.1 ± 0.0
SIMPLE_CNN_RNN_without_positions_and_content	39.6 ± 0.0	16.2 ± 0.0	24.4 ± 0.0
SIMPLE_CNN_RNN_without_positions_and_salience	39.6 ± 0.0	16.2 ± 0.0	24.4 ± 0.0
SIMPLE_CNN_RNN_without_position_and_novelty	39.6 ± 0.0	16.2 ± 0.0	24.4 ± 0.0
SIMPLE_CNN_RNN_novelty_only	40.0 ± 0.0	16.7 ± 0.0	25.3 ± 0.0
SIMPLE_CNN_RNN_salience_only	39.6 ± 0.0	16.2 ± 0.0	24.4 ± 0.0
SIMPLE_CNN_RNN_content_only	39.6 ± 0.0	16.2 ± 0.0	24.4 ± 0.0

RNN_RNN on NYT50 (limited-length ROUGE Recall)

model	ROUGE-1	ROUGE-2	ROUGE-L
HeterSUMGraph (Wang)	46.89	26.26	42.58
RNN_RNN	47.3 ± 0.0	26.7 ± 0.0	*35.7 ± 0.0**

*: maybe the ROUGE-L have changed in the rouge library I use.

RNN_RNN on general geography, architecture town planning and geology French wikipedia articles (limited-length ROUGE Recall)

dataset	ROUGE-1	ROUGE-2	ROUGE-L
Wikipedia-0.5	31.5 ± 0.0	10.0 ± 0.0	20.0 ± 0.0
Wikipedia-high-25	24.0 ± 0.0	6.8 ± 0.0	15.0 ± 0.0
Wikipedia-low-25	33.3 ± 0.0	13.3 ± 0.0	23.0 ± 0.0

RNN_RNN with NER on general geography, architecture town planning and geology French wikipedia articles (limited-length ROUGE Recall)

model	ROUGE-1	ROUGE-2	ROUGE-L	ACCURACY
RNN_RNN_summary_and_ner	31.6 ± 0.1	10.0 ± 0.0	20.0	0.875 ± 0.0
RNN_RNN_OnlyNER	N/A	N/A	N/A	0.879 ± 0.0

* Wikipedia-0.5: general geography, architecture town planning and geology French wikipedia articles with len(summary)/len(content) <= 0.5.
* Wikipedia-high-25: first 25% of general geography, architecture town planning and geology French wikipedia articles sorted by len(summary)/len(content) descending.
* Wikipedia-low-25: first 25% of general geography, architecture town planning and geology French wikipedia articles sorted by len(summary)/len(content) ascending.

See HSG_ExSUM_NER repository for wikipedia scraping and preprocessing (that repository conatain script for scrapping and preprocessing).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SummaRuNNer (extractive summarization)

Clone project

Enter into the directory

Create environnement

Activate environnement

Install dependencies

Install nltk data

Convert initial dataset to valid pandas json

Compute labels

Embeddings

Training

Result

CNN-DailyMail (full-length f1 rouge)

RNN_RNN on NYT50 (limited-length ROUGE Recall)

RNN_RNN on general geography, architecture town planning and geology French wikipedia articles (limited-length ROUGE Recall)

RNN_RNN with NER on general geography, architecture town planning and geology French wikipedia articles (limited-length ROUGE Recall)

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
models		models
utils		utils
utils_french		utils_french
utils_french_sum_ner		utils_french_sum_ner
.gitignore		.gitignore
00-0-convert_raw_cnndailymail_to_json.ipynb		00-0-convert_raw_cnndailymail_to_json.ipynb
00-1-compute_label_cnndailymail.py		00-1-compute_label_cnndailymail.py
00-2-compute_the_average_number_of_sentences_per_document.py		00-2-compute_the_average_number_of_sentences_per_document.py
00-3-compute_the_average_proportion_of_sentences_per_document.py		00-3-compute_the_average_proportion_of_sentences_per_document.py
01-train_RNN_RNN.ipynb		01-train_RNN_RNN.ipynb
01-train_RNN_RNN_Glove300.ipynb		01-train_RNN_RNN_Glove300.ipynb
01-train_RNN_RNN_fasttext_fr.ipynb		01-train_RNN_RNN_fasttext_fr.ipynb
02-train_SIMPLE_CNN_RNN.ipynb		02-train_SIMPLE_CNN_RNN.ipynb
02-train_SIMPLE_CNN_RNN_Glove300.ipynb		02-train_SIMPLE_CNN_RNN_Glove300.ipynb
02-train_SIMPLE_CNN_RNN_fasttext_fr.ipynb		02-train_SIMPLE_CNN_RNN_fasttext_fr.ipynb
03-train_COMPLEX_CNN_RNN.ipynb		03-train_COMPLEX_CNN_RNN.ipynb
03-train_COMPLEX_CNN_RNN_Glove300.ipynb		03-train_COMPLEX_CNN_RNN_Glove300.ipynb
03-train_COMPLEX_CNN_RNN_fasttext_fr.ipynb		03-train_COMPLEX_CNN_RNN_fasttext_fr.ipynb
04-train_COMPLEX_CNN_RNN_max_pool.ipynb		04-train_COMPLEX_CNN_RNN_max_pool.ipynb
04-train_COMPLEX_CNN_RNN_max_pool_Glove300.ipynb		04-train_COMPLEX_CNN_RNN_max_pool_Glove300.ipynb
04-train_COMPLEX_CNN_RNN_max_pool_fasttext_fr.ipynb		04-train_COMPLEX_CNN_RNN_max_pool_fasttext_fr.ipynb
05-train_RES_CNN_RNN.ipynb		05-train_RES_CNN_RNN.ipynb
05-train_RES_CNN_RNN_Glove300.ipynb		05-train_RES_CNN_RNN_Glove300.ipynb
05-train_RES_CNN_RNN_fasttext_fr.ipynb		05-train_RES_CNN_RNN_fasttext_fr.ipynb
06-train_SIMPLE_CNN_RNN_without_text_content.ipynb		06-train_SIMPLE_CNN_RNN_without_text_content.ipynb
06-train_SIMPLE_CNN_RNN_without_text_content_Glove300.ipynb		06-train_SIMPLE_CNN_RNN_without_text_content_Glove300.ipynb
06-train_SIMPLE_CNN_RNN_without_text_content_fasttext_fr.ipynb		06-train_SIMPLE_CNN_RNN_without_text_content_fasttext_fr.ipynb
07-train_SIMPLE_CNN_RNN_without_position.ipynb		07-train_SIMPLE_CNN_RNN_without_position.ipynb
07-train_SIMPLE_CNN_RNN_without_position_Glove300.ipynb		07-train_SIMPLE_CNN_RNN_without_position_Glove300.ipynb
07-train_SIMPLE_CNN_RNN_without_position_fasttext_fr.ipynb		07-train_SIMPLE_CNN_RNN_without_position_fasttext_fr.ipynb
08-train_SIMPLE_CNN_RNN_abs_pos_only.ipynb		08-train_SIMPLE_CNN_RNN_abs_pos_only.ipynb
08-train_SIMPLE_CNN_RNN_abs_pos_only_Glove300.ipynb		08-train_SIMPLE_CNN_RNN_abs_pos_only_Glove300.ipynb
08-train_SIMPLE_CNN_RNN_abs_pos_only_fasttext_fr.ipynb		08-train_SIMPLE_CNN_RNN_abs_pos_only_fasttext_fr.ipynb
09-train_SIMPLE_CNN_RNN_rel_pos_only.ipynb		09-train_SIMPLE_CNN_RNN_rel_pos_only.ipynb
09-train_SIMPLE_CNN_RNN_rel_pos_only_Glove300.ipynb		09-train_SIMPLE_CNN_RNN_rel_pos_only_Glove300.ipynb
09-train_SIMPLE_CNN_RNN_rel_pos_only_fasttext_fr.ipynb		09-train_SIMPLE_CNN_RNN_rel_pos_only_fasttext_fr.ipynb
10-train_SIMPLE_CNN_RNN_without_position_and_content.ipynb		10-train_SIMPLE_CNN_RNN_without_position_and_content.ipynb
10-train_SIMPLE_CNN_RNN_without_position_and_content_Glove300.ipynb		10-train_SIMPLE_CNN_RNN_without_position_and_content_Glove300.ipynb
10-train_SIMPLE_CNN_RNN_without_position_and_content_fasttext_fr.ipynb		10-train_SIMPLE_CNN_RNN_without_position_and_content_fasttext_fr.ipynb
11-train_SIMPLE_CNN_RNN_without_position_and_salience.ipynb		11-train_SIMPLE_CNN_RNN_without_position_and_salience.ipynb
11-train_SIMPLE_CNN_RNN_without_position_and_salience_Glove300.ipynb		11-train_SIMPLE_CNN_RNN_without_position_and_salience_Glove300.ipynb
11-train_SIMPLE_CNN_RNN_without_position_and_salience_fasttext_fr.ipynb		11-train_SIMPLE_CNN_RNN_without_position_and_salience_fasttext_fr.ipynb
12-train_SIMPLE_CNN_RNN_without_position_and_novelty.ipynb		12-train_SIMPLE_CNN_RNN_without_position_and_novelty.ipynb
12-train_SIMPLE_CNN_RNN_without_position_and_novelty_Glove300.ipynb		12-train_SIMPLE_CNN_RNN_without_position_and_novelty_Glove300.ipynb
12-train_SIMPLE_CNN_RNN_without_position_and_novelty_fasttext_fr.ipynb		12-train_SIMPLE_CNN_RNN_without_position_and_novelty_fasttext_fr.ipynb
13-train_SIMPLE_CNN_RNN_novelty_only.ipynb		13-train_SIMPLE_CNN_RNN_novelty_only.ipynb
13-train_SIMPLE_CNN_RNN_novelty_only_Glove300.ipynb		13-train_SIMPLE_CNN_RNN_novelty_only_Glove300.ipynb
13-train_SIMPLE_CNN_RNN_novelty_only_fasttext_fr.ipynb		13-train_SIMPLE_CNN_RNN_novelty_only_fasttext_fr.ipynb
14-train_SIMPLE_CNN_RNN_sailence_only.ipynb		14-train_SIMPLE_CNN_RNN_sailence_only.ipynb
14-train_SIMPLE_CNN_RNN_sailence_only_Glove300.ipynb		14-train_SIMPLE_CNN_RNN_sailence_only_Glove300.ipynb
14-train_SIMPLE_CNN_RNN_sailence_only_fasttext_fr.ipynb		14-train_SIMPLE_CNN_RNN_sailence_only_fasttext_fr.ipynb
15-train_SIMPLE_CNN_RNN_content_only.ipynb		15-train_SIMPLE_CNN_RNN_content_only.ipynb
15-train_SIMPLE_CNN_RNN_content_only_Glove300.ipynb		15-train_SIMPLE_CNN_RNN_content_only_Glove300.ipynb
15-train_SIMPLE_CNN_RNN_content_only_fasttext_fr.ipynb		15-train_SIMPLE_CNN_RNN_content_only_fasttext_fr.ipynb
16-train_RNN_RNN_NYT50.ipynb		16-train_RNN_RNN_NYT50.ipynb
17-train_RNN_RNN_Wikipedia_geo_high_25.ipynb		17-train_RNN_RNN_Wikipedia_geo_high_25.ipynb
18-train_RNN_RNN_Wikipedia_geo_low_25.ipynb		18-train_RNN_RNN_Wikipedia_geo_low_25.ipynb
19-train_RNN_RNN_Wikipedia_geo_ratio_sc_0.5.ipynb		19-train_RNN_RNN_Wikipedia_geo_ratio_sc_0.5.ipynb
20-train_RNN_RNN_summary_and_ner_Wikipedia_geo_ratio_sc_0.5.ipynb		20-train_RNN_RNN_summary_and_ner_Wikipedia_geo_ratio_sc_0.5.ipynb
21-train_RNN_RNN_OnlyNER_Wikipedia_geo_ratio_sc_0.5.ipynb		21-train_RNN_RNN_OnlyNER_Wikipedia_geo_ratio_sc_0.5.ipynb
README.md		README.md
requirements.txt		requirements.txt

rhfdn/SummaRuNNer

Folders and files

Latest commit

History

Repository files navigation

SummaRuNNer (extractive summarization)

Clone project

Enter into the directory

Create environnement

Activate environnement

Install dependencies

Install nltk data

Convert initial dataset to valid pandas json

Compute labels

Embeddings

Training

Result

CNN-DailyMail (full-length f1 rouge)

RNN_RNN on NYT50 (limited-length ROUGE Recall)

RNN_RNN on general geography, architecture town planning and geology French wikipedia articles (limited-length ROUGE Recall)

RNN_RNN with NER on general geography, architecture town planning and geology French wikipedia articles (limited-length ROUGE Recall)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages