Skip to content

Sandhi Splitter for Indian Languages (Currently only Malaylam)

Notifications You must be signed in to change notification settings

libindic/sandhi-splitter

Repository files navigation

Sandhi Splitter

Build Status Coverage Status

A probabalistic approach to solving the problem of agglutination which exists in indic languages. Implementation here applies for Malayalam, although codes used are mostly language agnostic.

Installation

  1. First clone the repository
	git clone https://github.com/libindic/sandhi-splitter.git
  1. Create a installable source and then install using pip
	python setup.py sdist
	pip install dist/sandhisplitter*.tar.gz

Note: We suggest you work on virtualenv instead of installing system-wide using sudo, since module is still under development.

Training and Testing

After installation, with necessary arguments, use

    sandhisplitter_train [--help] [args]
    sandhisplitter_benchmark_model [--help] [args]

For more details, refer to docs/index.rst

Using the Sandhisplitter class

Sandhisplitter class provides two main functions, split and join.

>>> from sandhisplitter import Sandhisplitter
>>> s = Sandhisplitter()
>>> s.split('ആദ്യമെത്തി')
(['ആദ്യം', 'എത്തി'], [4])
>>> s.split('വയ്യാതെയായി')
(['വയ്യാതെ', 'ആയി'], [7])
>>> s.split('എന്നെക്കൊണ്ടുവയ്യ')
(['എന്നെക്കൊണ്ടുവയ്യ'], [])
>>> s.split('ഇന്നത്തെക്കാലത്ത്')
(['ഇന്നത്തെക്കാലത്ത്'], [])
>>> s.split('എന്തൊക്കെയോ')
(['എന്ത്', 'ഒക്കെയോ'], [3])

>>> s.join(['ആദ്യം', 'ആയി'])
'ആദ്യമായി'

About

Sandhi Splitter for Indian Languages (Currently only Malaylam)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published