From 2432efda47257828caa55d0844b4577218e02f4b Mon Sep 17 00:00:00 2001 From: Ali Safaya Date: Wed, 11 Aug 2021 17:42:49 +0300 Subject: [PATCH] Update README.md --- README.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/README.md b/README.md index 3f3f734..e44fe46 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ # trnews-64 dataset +[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5180654.svg)](https://doi.org/10.5281/zenodo.5180654) + __trnews-64__ is a character language modeling dataset that contain 64 million words of news articles and columns. It can be utilized as a benchmark for different modeling long range dependencies in Turkish language. @@ -9,6 +11,16 @@ This dataset contains a mix of news articles from different topics and journals This dataset was preprocessed and clean from infrequent characters. The main character set is shared in the file `tr.charset.json`, which contains 124 characters in total. This includes Turkish upper/lower case characters along with punctuations and some other common characters. +## Download + +The dataset is hosted on [Zenodo](https://zenodo.org/), it can be downloaded using the following: + +```bash +wget -O trnews-64.tar.bz2 https://zenodo.org/record/5180654/files/trnews-64.tar.bz2?download=1 +tar -xf trnews-64.tar.bz2 +rm trnews-64.tar.bz2 +``` + ## Details Dataset splits are shared in raw text format and the articles are seperated by empty lines. @@ -58,6 +70,10 @@ with open("trnews-64.test.raw") as fi: } ``` +## License + +This dataset is licensed under [Creative Commons Attribution 4.0 International](./LICENSE) license. + ## Contact Ali Safaya (alisafaya at gmail dot com).