GitHub - rstupek/ml: Machine Learning in Java

rstupek / ml Public

forked from jottinger/ml

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Machine Learning in Java

View license

0 stars 2 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
bayes		bayes
common		common
corpus-test		corpus-test
perceptron		perceptron
.gitignore		.gitignore
LICENSE		LICENSE
README		README
pom.xml		pom.xml

Repository files navigation

This is the ml project. It's a clone of the ci-bayes project, and
is identical in codebase, although development is expected to proceed.

The ml project is designed to implement a range of machine learning
algorithms, primarily focusing on classification. The initial codebase
is aimed at providing bayes classification, with other types of
classification engines following.

Using the Bayes Classifier

There are two Bayesian classifiers included: a simple (naive) classifier,
and a Fisher classifier. In general, the Fisher classifier is "better."
(As usual, your results may vary depending on your use case.)

Using the classifier is simple:

// the default data factory is datagrid-based, see Infinispan documentation
Classifier classifier=new FisherClassifier();

// training phase
classifier.train("the quick brown fox jumps over the lazy dog's tail", "dog");
classifier.train("that cat is going out with a fox", "cat");

// classification phase
String classification=classifier.classify("this is a some text");

You can also provide a default classification to the classifier:

// classification phase
String classification=classifier.classify("this is a some text", "unknown");

Persisting the dataset involves the creation of a ClassifierDataFactory.
The default data factory is based on Infinispan, which is a key/value-based
data grid.

Ideally, one would configure Infinispan to use a backing datastore to
persist the training data.

Using the Perceptron

A perceptron follows the same general modus as the Bayesian Classifier, with
one extra step: the data repository.

To use the perceptron, one creates the repository (a repository based on HSQLDB
is provided), then one creates the perceptron with that repository, then
the perceptron is trained, and then queried.

An example:

PerceptronRepository repo = new HSQLDBPerceptronRepository();
Perceptron perceptron=new PerceptronImpl(repo);

// training phase
Object[] targetArray=new Object[]{"test for echo", "counterparts"};

// one provides the source text, the list of targets to consider, then
// the actual target
perceptron.train("half the world", targets, "test for echo");

// getFirstResult() returns the most likely match
Object classification=perceptron.getFirstResult("world", targets);

There are also convenience methods that make the perceptron's API simpler;
they use the entire set of target tokens as input. For the example above:

PerceptronRepository repo = new HSQLDBPerceptronRepository();
Perceptron perceptron=new PerceptronImpl(repo);

// ensure the targets exist
perceptron.createTarget("test for echo");
perceptron.createTarget("counterparts");

// train on input
perceptron.train("half the world", "test for echo");

// get classification
Object classification=perceptron.getFirstResult("world");

Note that these convenience methods are very slow compared to the version
in which the targets are supplied, especially if there are many
trainings applied at once.

The Tokenizer's implications for the Perceptron

The Perceptron is able to tokenize input, as shown in the API example above.
However, the tokenizer is able to affect the input by stemming text.

The default Tokenizer does not affect the text at all, but merely splits it
along whitespace. You can change this to a PorterStemmer as follows:

perceptron.setTokenizer(new PorterStemmer());

Again, the tokenizers affect the input and may thus affect the output. Also note
that if you mix the API inputs (by calling the form that expects a List<Object> to
train and then the single-object corpus to classify) your results will not be valid,
because "stemming," for example, will be counted as "stemming" for training and
"stem" for classification purposes.