-
Former
kgram_freqs
class is now calledsbo_kgram_freqs
. The constructorkgram_freqs()
is still available as an alias tosbo_kgram_freqs()
. -
Former
sbo_preds
class is now substituted by two classes:- `sbo_predictor`: for interactive use - `sbo_predtable`: for storing text predictors out of memory (e.g. `save()` to file)
-
sbo_predictor
andsbo_predtable
objects are obtained by the homonym constructors, which are now S3 generics acceptingcharacter
input, as well assbo_kgram_freqs
andsbo_predtable
(for thesbo_predictor()
constructor) class objects. In particular, these allow to directly train a text predictor without storing the intermediatesbo_dictionary
, andkgram_freqs
objects. -
The behaviour of the
dict
argument inkgram_freqs()
andkgram_freqs_fast()
has changed, now accepting either asbo_dictionary
, acharacter
or aformula
(see also 'New features'). -
The
sbo_predictor
implementation dramatically improves the speed ofpredict()
(by a factor of x10). A single call topredict()
now allocates a few kBs of RAM (whereas it previously allocated few MBs, c.f. issue #10). -
Metadata of
sbo_kgram_freqs
andsbo_pred*
objects is now stored via attributes (#11).
- New S3 class
sbo_dictionary
. - New S3 class
word_coverage
with generic constructors and a preconfiguredplot()
method. - Dictionaries in
kgram_freqs()
andsbo_pred*()
can now be built also with a fixed target coverage fraction of training corpus. - Added
prune()
generic function for reducing -gram order ofkgram_freqs
andsbo_predtable
's. - Added
summary()
methods forsbo_kgram_freqs
andsbo_pred*
objects; correspondingly, the output ofprint()
has been simplified considerably (#5). - The object of class
sbo_kgram_freqs
,sbo_dictionary
,sbo_predictor
andsbo_predtable
can be constructed either through the homonymous constructors, or through the aliaseskgram_freqs()
,dictionary()
,predictor()
,predtable()
.
-
sbo
now hasSystemRequirements: C++11
, for correct integration with C++11 code (in particularstd::unordered_map
). -
Model training (with
sbo_predictor()
) is now considerably faster, due to optimizations in the algorithm for building Stupid Back-Off prediction tables. -
The Stupid Back-Off algorithm is now thoroughly tested, and small inconsistencies between the
predict.kgram_freqs()
andpredict.sbo_predictor()
methods have been fixed, including:- Proper handling of unknown words - Consistent handling of ties in prediction probabilities.
-
Model evaluation in
eval_sbo_predictor()
is now carried out by sampling a single sentence from each document in test corpus. -
Removed unnecessary dependencies from
Depends
andImports
package fields.
- Patch addressing unexpected behaviour of
erase
argument inpreprocess()
andkgram_freqs_fast()
, c.f. issue #17.
- Changed leading to trailing underscore in private variables definition of C++
kgramFreqs
class, as per §1.6.4 of the "Writing R extensions" guide. - Removed Catch tests infrastructure for C++ code.
- Added
kgram_freqs_fast()
for fast and memory efficient kgram tokenization using the default text preprocessing utility.
- The infrastructure of
kgram_freqs()
,get_word_freqs()
,preprocess()
, andpredict.sbo_preds()
has been entirely rewritten in C++. - Added
tokenize_sentences()
function for sentence level tokenization. kgram_freqs()
now accepts any user defined single character EOS token, through theEOS
argument.
- Added
preproc
argument tokgram_freqs()
andget_word_freqs()
, for custom training corpus preprocessing. - The
dict
argument ofkgram_freqs()
now also accepts numeric values, allowing to build a dictionary directly from the training corpus.
- Added
predict
method forsbo_kgram_freqs
class.