Skip to content

nickm197/Longtexts

Repository files navigation

Longtexts

A collection of long literary texts for the computational linguistics research purposes.

The texts are, as I believe, in the public domain. The texts have been obtained from Project Gutenberg, Wikisource, Royallib and lib.ru and preprocessed so as to fit specific research purposes:

  • Copyright texts were removed from the files
  • Author and translator notes were removed
  • Table of contents and any indices were removed, except for the table of contents from Don Quixote
  • Any links to illustrations have been removed
  • In the Russian version of War and Peace any non-Russian text have been replaced with Russian translations
  • Etymology was removed from Moby-Dick or, The Whale, where encountered, as some languages missed it

Notice and take down policy

Notice: Should you consider that the data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as the email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

Send the request to nickm@ntrlab.com

Take down: I will comply with legitimate requests by removing the affected sources from the corpus.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published