forked from jottinger/ml
-
Notifications
You must be signed in to change notification settings - Fork 0
Machine Learning in Java
License
rstupek/ml
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is the ml project. It's a clone of the ci-bayes project, and is identical in codebase, although development is expected to proceed. The ml project is designed to implement a range of machine learning algorithms, primarily focusing on classification. The initial codebase is aimed at providing bayes classification, with other types of classification engines following. Using the Bayes Classifier There are two Bayesian classifiers included: a simple (naive) classifier, and a Fisher classifier. In general, the Fisher classifier is "better." (As usual, your results may vary depending on your use case.) Using the classifier is simple: // the default data factory is datagrid-based, see Infinispan documentation Classifier classifier=new FisherClassifier(); // training phase classifier.train("the quick brown fox jumps over the lazy dog's tail", "dog"); classifier.train("that cat is going out with a fox", "cat"); // classification phase String classification=classifier.classify("this is a some text"); You can also provide a default classification to the classifier: // classification phase String classification=classifier.classify("this is a some text", "unknown"); Persisting the dataset involves the creation of a ClassifierDataFactory. The default data factory is based on Infinispan, which is a key/value-based data grid. Ideally, one would configure Infinispan to use a backing datastore to persist the training data. Using the Perceptron A perceptron follows the same general modus as the Bayesian Classifier, with one extra step: the data repository. To use the perceptron, one creates the repository (a repository based on HSQLDB is provided), then one creates the perceptron with that repository, then the perceptron is trained, and then queried. An example: PerceptronRepository repo = new HSQLDBPerceptronRepository(); Perceptron perceptron=new PerceptronImpl(repo); // training phase Object[] targetArray=new Object[]{"test for echo", "counterparts"}; // one provides the source text, the list of targets to consider, then // the actual target perceptron.train("half the world", targets, "test for echo"); // getFirstResult() returns the most likely match Object classification=perceptron.getFirstResult("world", targets); There are also convenience methods that make the perceptron's API simpler; they use the entire set of target tokens as input. For the example above: PerceptronRepository repo = new HSQLDBPerceptronRepository(); Perceptron perceptron=new PerceptronImpl(repo); // ensure the targets exist perceptron.createTarget("test for echo"); perceptron.createTarget("counterparts"); // train on input perceptron.train("half the world", "test for echo"); // get classification Object classification=perceptron.getFirstResult("world"); Note that these convenience methods are very slow compared to the version in which the targets are supplied, especially if there are many trainings applied at once. The Tokenizer's implications for the Perceptron The Perceptron is able to tokenize input, as shown in the API example above. However, the tokenizer is able to affect the input by stemming text. The default Tokenizer does not affect the text at all, but merely splits it along whitespace. You can change this to a PorterStemmer as follows: perceptron.setTokenizer(new PorterStemmer()); Again, the tokenizers affect the input and may thus affect the output. Also note that if you mix the API inputs (by calling the form that expects a List<Object> to train and then the single-object corpus to classify) your results will not be valid, because "stemming," for example, will be counted as "stemming" for training and "stem" for classification purposes.
About
Machine Learning in Java
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published