Skip to content

Releases: vgherard/sbo

v0.5.0 "Half-a-gram"

05 Dec 20:01
Compare
Choose a tag to compare

sbo 0.5.0

API and UI changes

  • Former kgram_freqs class is now called sbo_kgram_freqs. The constructor
    kgram_freqs() is still available as an alias to sbo_kgram_freqs().

  • Former sbo_preds class is now substituted by two classes:

      - `sbo_predictor`: for interactive use
      - `sbo_predtable`: for storing text predictors out of memory (e.g. 
      `save()` to file)
    
  • sbo_predictor and sbo_predtable objects are obtained by the homonym
    constructors, which are now S3 generics accepting character input, as well as
    sbo_kgram_freqs and sbo_predtable (for the sbo_predictor() constructor)
    class objects. In particular, these allow to directly train a text predictor
    without storing the intermediate sbo_dictionary, and kgram_freqs objects.

  • The behaviour of the dict argument in kgram_freqs() and kgram_freqs_fast() has changed, now accepting either a sbo_dictionary, a character or a formula (see also 'New features').

  • The sbo_predictor implementation dramatically improves the speed of
    predict() (by a factor of x10). A single call to predict() now allocates a
    few kBs of RAM (whereas it previously allocated few MBs, c.f. issue #10).

  • Metadata of sbo_kgram_freqs and sbo_pred* objects is now stored via
    attributes (#11).

New features

  • New S3 class sbo_dictionary.
  • New S3 class word_coverage with generic constructors and a preconfigured
    plot() method.
  • Dictionaries in kgram_freqs() and sbo_pred*() can now
    be built also with a fixed target coverage fraction of training corpus.
  • Added prune() generic function for reducing -gram order of
    kgram_freqs and sbo_predtable's.
  • Added summary() methods for sbo_kgram_freqs and sbo_pred* objects;
    correspondingly, the output of print() has been simplified considerably (#5).
  • The object of class sbo_kgram_freqs, sbo_dictionary, sbo_predictor and
    sbo_predtable can be constructed either through the homonymous constructors,
    or through the aliases kgram_freqs(), dictionary(), predictor(),
    predtable().

Other improvements and patches

  • sbo now has SystemRequirements: C++11, for correct integration with C++11 code (in particular std::unordered_map).

  • Model training (with sbo_predictor()) is now considerably faster, due to
    optimizations in the algorithm for building Stupid Back-Off prediction tables.

  • The Stupid Back-Off algorithm is now thoroughly tested, and small
    inconsistencies between the predict.kgram_freqs() and
    predict.sbo_predictor() methods have been fixed, including:

      - Proper handling of unknown words
      - Consistent handling of ties in prediction probabilities.
    
  • Model evaluation in eval_sbo_predictor() is now carried out by sampling
    a single sentence from each document in test corpus.

  • Removed unnecessary dependencies from Depends and Imports package fields.

v0.3.2

09 Nov 15:32
Compare
Choose a tag to compare
  • Patch addressing inexpected behaviour of erase argument in
    preprocess() and get_kgram_freqs_fast(), c.f. issue #17.

sbo 0.3.1

  • Changed leading to trailing underscore in private variables definition of C++ kgramFreqs class, as per §1.6.4 of the "Writing R extensions" guide.
  • Removed Catch tests infrastructure for C++ code.

First release

05 Nov 12:23
Compare
Choose a tag to compare
  • Added get_kgram_freqs_fast() for fast and memory efficient kgram tokenization using the default text preprocessing utility.