Harvesting of unigram and bigram data from various corpus data. First we carry out with Project Madurai corpus for prose data only (skip cir/seer unparsed poetry and all other poetry). This data and any scripts are under public-domain.
Currently 4036616 total words in 'plain_text' folder which contains unigram data and bigram data at word level. One may use open-tamil library to: - discover the unigram word-frequency of this corpus - discover the bi-gram word-frequency of this corpus (since successive words occur in successive lines)