Filipino language is characterized by variation within every province. Varieties which share similar features diverge from one another to different degrees. Divergent varieties are often referred to as dialects. In some cases, the varieties may be distinct enough that some would consider them to be separate languages. In other cases, the varieties may be sufficiently similar to be considered merely characteristic of a particular geographic region, social grouping, or historical era. Sometimes speakers may be aware of dialect variation and be able to label a particular dialect with a name. The variation may go largely unnoticed or overlooked.
Language identification is one of the pre-processing unit in natural language processing. It is the task of determining an author’s language through statistical computing and works by identifying patterns. It became increasingly important, as more and more textual data is making its way all. Language identification for most of the national language of every country already exists, but as said, the Filipino language for example is characterized by variety. These variations may affect and cause failure with succeeding pre-processing units of NLP.
Using a language model is one of the popular approaches to identify a language. Some of the known modeling techniques are through character n – gram, Markov models, naïve Bayes classifiers, support vector machines, and neural networks.
This native language identification tool recognizes 3 of the 8 major dialects or languages in the Philippines. These languages are Cebuano, Kapampangan and Pangasinense. The Filipino Native Language Identification used Markov chain for language modeling and maximum likelihood decision rule as a method for identifying the native language.
It is up and available at https://filipino-native-li.herokuapp.com