Skip to content

Latest commit

 

History

History
145 lines (93 loc) · 6.61 KB

TDD-C-202109-CC-005.md

File metadata and controls

145 lines (93 loc) · 6.61 KB

BOUN Web Corpus

A large corpus for Turkish was built and cleaned using some heuristics and the morphological parser. It is created because a large corpora is needed for the application and evaluation of statistical methods. Such corpora are also important for empirical methods that the linguists and lexicographers use to infer information about language.

The corpus is composed of four sub corpora. Three of these corpora (referred as NewsCor) are from three major newspapers in Turkish. The other corpus (referred as GenCor) is a general sampling of Turkish web pages. The combined corpus of these two corpora is be referred as BOUN Corpus. For the encoding of the web corpus, the XML Corpus Encoding Standard, XCES (see http://xces.org), is used and the corpus is encoded in paragraph and sentence. level.

Dataset Details

The statistics about the number of words (all words in the cor- pus), number of tokens (words and lexical units such as punctuation marks), and types (distinct tokens) are shown. The percentages of tokens and types that can be successfully parsed by the morphological parser are also indicated.

Corpus Words Tokens Types Tokens parsed(%) Types parsed(%)
Milliyet 59M 68M 1.1M 96.7 63.5
Ntvmsbnc 75M 86M 1.2M 96.4 55.8
Radikal 50M 58M 1.0M 97.0 65.7
NewsCor 184M 212M 2.2M 96.7 52.2
GenCor 239M 279M 2.3M 94.6 39.5
BOUN Corpus 423M 491M 4.1M 95.5 38.4

Samples

Samples of data instances from all types of data present in the dataset.

Example:

<p>
<s>
Telefona
zam
!
</s>
</p>
<p>
<s>
Türk
Telekom
,
yurtiçi
ve
yurt
dışı
telefon
görüşme
tarifelerine
bugünden
itibaren
zam
yaptı
.
</s>
<s>

Splits

These splits are created by newsCor, the data from GenCor is not included and produced splits are used for language modelling.

There are two scripts included to corpus. While textnorm.pl is used for text normalization, split.pl is used for splitting the corpus into train, development and test set.

The newsCor consists of milliyet news between 1998/01-2006/03, ntvmsnbc news between 2000/05-2006/09 and radikal news between 2001/04-2006/09. The data belongs to different newspaper can also be found in milliyet.xml, ntvmsnbc.xml or radikal.xml seperately.

The split.pl script uses the news of each newspaper dated 2005/06 for the development set and the news dated 2006/01 for the test set. All remaining data was used for the train split.

The script has been arranged so that there is no overlapping in the data in the splits.

Here are the sentence, word and token counts for each split:

split sentences words tokens
train.txt 15140463 182353914 1486343703
dev.txt 4052 49284 403576
test.txt 3448 41286 332950

Dataset Creation

Curation Rationale

In the domain of language processing,there is a need for large corpora for the application and evaluation of statistical methods. Such corpora are also important for empirical methods that the linguists and lexicographers use to infer information about language. These are the main motivation of the writers.

Data Source

The BOUN Web Corpus is composed of four subcorpora. Three of these corpora (referred as NewsCor) are from three major Turkish news portals (http://www.milliyet.com.tr, http://www.ntvmsnbc.com.tr, http://www.radikal.com.tr). The other corpus (referred as GenCor) is a general sampling of Turkish web pages.

Annotations

For data collection from the web,a web crawler - an automated script to browse the web as used by the search engines is implemented. Since the collected data from the web is very noisy, some automatic normalization and filtering methods are employed to clean the corpus. The multi step process to clean the corpus is:

  1. Decode HTML entities
  2. Trim white spaces at the start and end of the lines
  3. Estimate letter sequence counts from a Turkish text and use these counts to filter documents
  4. Remove duplicate lines to get rid of repetitions in web pages, such as text in navigation menus
  5. Remove documents with less than 1,000 characters
  6. Parse the documents using the morphological parser and remove those for which more than 25% of the words cannot be parsed

The normalization and filtering step removes about 60% of the text collected for NewsCor and 90% of the text collected for GenCor. This difference is expected since the web corpus data is very noisy when compared to the newspaper data.

Quality

About the types relative to the corpus size (number of tokens), if corpus size is increased beyond the current size of 491M tokens, new types will still continue to emerge. The experiment done by writers, indicate that the size of the current corpus does not cover all language usage.

Coverage statistics with respect to the vocabulary size (number of types) shows that 50% of the corpus is formed of only about 1,000 distinct words. It is observed that about 300K types are necessary in order to attain an acceptable coverage ratio (97-98%).The analysis of a similar statistic for the percentages of infrequent types shows that almost half of the types (about 2.0M) occur only once in the corpus. The number of types occurring less than 10 times is 3.4M and they represent 7.5M tokens in the corpus. Thus, the majority of types in the corpus are very infrequent and 98.4% of the corpus is formed of only 15.9% of the types.

Additional Information

Dataset Curators

Published by Haşim Sak, Tunga Güngör and Murat Saraçlar.

Citation Information

Please cite the following paper if you found this dataset useful:

Haşim Sak, Tunga Güngör, and Murat Saraçlar. Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus. In GoTAL 2008, volume 5221 of LNCS, 2008, pages 417–427. Springer.

@InProceedings{sak:08,
	Author = {Ha{\c s}im Sak and Tunga G{\"u}ng{\"o}r and Murat Sara{\c c}lar},
	Booktitle = {GoTAL 2008},
	Pages = {417--427},
	Title = {Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus},
	Volume = {5221},
	Series = {LNCS},
 	Publisher = {Springer},
	Year = {2008}
}

Uploaded and documented by Arda Goktogan: ardagoktogan gmail com and Yagmur Akarken: yakarken18 ku edu tr