This library provides Indian regional language datasets in an easy to use sklearn.dataset API format. You are free to use it in an application intended for commercial uses.
You can use pip
to install this library
pip install indic-nlp-datasets
To install the latest version of the datasets, use
pip install git+https://github.com/rahul1990gupta/indic-nlp-datasets.git@master
These are the datasets available in the library
Name | Size | submodule | language |
---|---|---|---|
Wikipedia | 275 MB | load_wikipedia |
hi |
Oscar Common Crawl | 17 GB | load_occ |
hi |
News Crawl | 472 MB | load_news_crawl |
hi |
Monlingual | 2.45 GB | load_monolingual |
hi |
Tweet Corpus | 875 MB | load_tweets |
hi |
Hinglish Corpus | 18 MB | load_hinglish |
hi |
Devdas | 300 KB | load_devdas |
hi |
After installation, you can start by importing the dataset
from idatasets import load_devdas
devdas = load_devdas()
print(devdas.desc) # prints description of the data
print(devdas.created_at) # date/year when dataset was created
for sent in devdas.data:
# process text chunks