by Shlomi Hod, Maximilian Seidler and Vageesh Saxena.
A differentiable function that measures text complexity/readability will be beneficial as part of the loss function of neural network text simplification system. However, to our best knowledge, there are not such a differentiable function.
Designing a differentiable readability measure, either by classic machine learning models or neural networks.
Leveraging a leveled corpus of articles into a real number readability score.
- Corpora Analysis
- End2End Design
- Stacked Design
In the paper we used the WeeBit corpus, however, we have also prepared other corpora, in English and German. All the corpora are described here. For access to the texts, please send an email to the authors.
Each corpus has its own HD5 file with the key text_df
that contains the
articles, with two these columns:
text
- The text itself.y
- The level label (integer, stating from0
), as described below.
80%-20% with shuffle
import pandas as pd
with pd.HDFStore('data/weebit/weebit.h5') as weebit_store:
text_df = weebit_store['text_df']
train_df = weebit_store['train_df']
test_df = weebit_store['test_df']
Crude word count:
- Weebit - 458,497 words
- GEO - 1,188,247 words in total
- DW - 934,734 words in total
Sowmya Vajjala and Detmar Meurers: "On Improving the Accuracy of
Readability Classification using Insights from Second Language
Acquisition". Proceedings of the 7th Workshop on Innovative Use of NLP
for Building Educational Applications (BEA7), Association for
Computational Linguistics. 2012.
0 - WRLevel2 (Age 7-8)
1 - WRLevel3 (Age 8-9)
2 - WRLevel4 (Age 9-10)
3 - BitKS3 (Age 11-14)
4 - BitGCSE (Age 14-16)
The BitGCSE
class has ~x5 texts than the other classes, therefore it was
downsampled to 800 texts after all the cleaning and preprocessing.
level #text %text #train %train #test %test
0 607 16.69 486 16.71 121 16.62
1 788 21.67 630 21.66 158 21.70
2 798 21.95 638 21.94 160 21.98
3 643 17.68 514 17.68 129 17.72
4 800 22.00 640 22.01 160 21.98
Based on Lexile
300 - 1670
2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 12.0
10.0 23
11.0 2
13.0 1
#text %text #train %train #test %test
2.0 921 4.25 737 4.25 184 4.25
3.0 2681 12.37 2145 12.37 536 12.37
4.0 2582 11.92 2065 11.91 517 11.93
5.0 3449 15.92 2759 15.92 690 15.92
6.0 2291 10.57 1833 10.57 458 10.57
7.0 2570 11.86 2056 11.86 514 11.86
8.0 1650 7.61 1320 7.62 330 7.61
9.0 1804 8.33 1443 8.32 361 8.33
12.0 3720 17.17 2976 17.17 744 17.17
name # min max mean std
0 text 21668 300 1670 939.476186 268.819219
1 train 17334 300 1670 939.390793 268.471205
2 test 4334 310 1660 939.817720 270.237459
train-test Kolmogorov-Smirnov p-value 0.9999167905571691
0 - GEOlino
1 - GEO
Deutsche Welle - Deutsche Lernen news articles leveled by CEFR.
0 - B1
1 - B2, C1