Releases: vgherard/sbo
v0.5.0 "Half-a-gram"
sbo 0.5.0
API and UI changes
-
Former
kgram_freqs
class is now calledsbo_kgram_freqs
. The constructor
kgram_freqs()
is still available as an alias tosbo_kgram_freqs()
. -
Former
sbo_preds
class is now substituted by two classes:- `sbo_predictor`: for interactive use - `sbo_predtable`: for storing text predictors out of memory (e.g. `save()` to file)
-
sbo_predictor
andsbo_predtable
objects are obtained by the homonym
constructors, which are now S3 generics acceptingcharacter
input, as well as
sbo_kgram_freqs
andsbo_predtable
(for thesbo_predictor()
constructor)
class objects. In particular, these allow to directly train a text predictor
without storing the intermediatesbo_dictionary
, andkgram_freqs
objects. -
The behaviour of the
dict
argument inkgram_freqs()
andkgram_freqs_fast()
has changed, now accepting either asbo_dictionary
, acharacter
or aformula
(see also 'New features'). -
The
sbo_predictor
implementation dramatically improves the speed of
predict()
(by a factor of x10). A single call topredict()
now allocates a
few kBs of RAM (whereas it previously allocated few MBs, c.f. issue #10). -
Metadata of
sbo_kgram_freqs
andsbo_pred*
objects is now stored via
attributes (#11).
New features
- New S3 class
sbo_dictionary
. - New S3 class
word_coverage
with generic constructors and a preconfigured
plot()
method. - Dictionaries in
kgram_freqs()
andsbo_pred*()
can now
be built also with a fixed target coverage fraction of training corpus. - Added
prune()
generic function for reducing -gram order of
kgram_freqs
andsbo_predtable
's. - Added
summary()
methods forsbo_kgram_freqs
andsbo_pred*
objects;
correspondingly, the output ofprint()
has been simplified considerably (#5). - The object of class
sbo_kgram_freqs
,sbo_dictionary
,sbo_predictor
and
sbo_predtable
can be constructed either through the homonymous constructors,
or through the aliaseskgram_freqs()
,dictionary()
,predictor()
,
predtable()
.
Other improvements and patches
-
sbo
now hasSystemRequirements: C++11
, for correct integration with C++11 code (in particularstd::unordered_map
). -
Model training (with
sbo_predictor()
) is now considerably faster, due to
optimizations in the algorithm for building Stupid Back-Off prediction tables. -
The Stupid Back-Off algorithm is now thoroughly tested, and small
inconsistencies between thepredict.kgram_freqs()
and
predict.sbo_predictor()
methods have been fixed, including:- Proper handling of unknown words - Consistent handling of ties in prediction probabilities.
-
Model evaluation in
eval_sbo_predictor()
is now carried out by sampling
a single sentence from each document in test corpus. -
Removed unnecessary dependencies from
Depends
andImports
package fields.
v0.3.2
- Patch addressing inexpected behaviour of
erase
argument in
preprocess()
andget_kgram_freqs_fast()
, c.f. issue #17.
sbo 0.3.1
- Changed leading to trailing underscore in private variables definition of C++
kgramFreqs
class, as per §1.6.4 of the "Writing R extensions" guide. - Removed Catch tests infrastructure for C++ code.
First release
- Added
get_kgram_freqs_fast()
for fast and memory efficient kgram tokenization using the default text preprocessing utility.