A collection of long literary texts for the computational linguistics research purposes.
The texts are, as I believe, in the public domain. The texts have been obtained from Project Gutenberg, Wikisource, Royallib and lib.ru and preprocessed so as to fit specific research purposes:
- Copyright texts were removed from the files
- Author and translator notes were removed
- Table of contents and any indices were removed, except for the table of contents from Don Quixote
- Any links to illustrations have been removed
- In the Russian version of War and Peace any non-Russian text have been replaced with Russian translations
- Etymology was removed from Moby-Dick or, The Whale, where encountered, as some languages missed it
Notice: Should you consider that the data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as the email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
Send the request to nickm@ntrlab.com
Take down: I will comply with legitimate requests by removing the affected sources from the corpus.