Overview

This library provides Indian regional language datasets in an easy to use sklearn.dataset API format. You are free to use it in an application intended for commercial uses.

Installation

You can use pip to install this library

pip install indic-nlp-datasets

To install the latest version of the datasets, use

pip install git+https://github.com/rahul1990gupta/indic-nlp-datasets.git@master

Datasets Available

These are the datasets available in the library

Name	Size	submodule	language
Wikipedia	275 MB	`load_wikipedia`	hi
Oscar Common Crawl	17 GB	`load_occ`	hi
News Crawl	472 MB	`load_news_crawl`	hi
Monlingual	2.45 GB	`load_monolingual`	hi
Tweet Corpus	875 MB	`load_tweets`	hi
Hinglish Corpus	18 MB	`load_hinglish`	hi
Devdas	300 KB	`load_devdas`	hi

Getting started

After installation, you can start by importing the dataset

from idatasets import load_devdas
devdas = load_devdas()
print(devdas.desc) # prints description of the data
print(devdas.created_at) # date/year when dataset was created
for sent in devdas.data:
    # process text chunks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Overview

Installation

Datasets Available

Getting started

Files

README.md

Latest commit

History

README.md

File metadata and controls

Overview

Installation

Datasets Available

Getting started