sbo
provides utilities for building and evaluating text predictors
based on Stupid
Back-off N-gram models
in R. It includes functions such as:
kgram_freqs()
: Extract k-gram frequency tables from a text corpussbo_predictor()
: Train a next-word predictor via Stupid Back-off.eval_sbo_predictor()
: Test text predictions against an independent corpus.
You can install the latest release of sbo
from CRAN:
install.packages("sbo")
You can install the development version of sbo
from GitHub:
# install.packages("devtools")
devtools::install_github("vgherard/sbo")
This example shows how to build a text predictor with sbo
:
library(sbo)
p <- sbo_predictor(sbo::twitter_train, # 50k tweets, example dataset
N = 3, # Train a 3-gram model
dict = sbo::twitter_dict, # Top 1k words appearing in corpus
.preprocess = sbo::preprocess, # Preprocessing transformation
EOS = ".?!:;" # End-Of-Sentence characters
)
The object p
can now be used to generate predictive text as follows:
predict(p, "i love") # a character vector
#> [1] "you" "it" "my"
predict(p, "you love") # another character vector
#> [1] "<EOS>" "me" "the"
predict(p,
c("i love", "you love", "she loves", "we love", "you love", "they love")
) # a character matrix
#> [,1] [,2] [,3]
#> [1,] "you" "it" "my"
#> [2,] "<EOS>" "me" "the"
#> [3,] "you" "my" "me"
#> [4,] "you" "our" "it"
#> [5,] "<EOS>" "me" "the"
#> [6,] "to" "you" "and"
For more general purpose utilities to work with n-gram models, you can
also check out my package
{kgrams}
.
For help, see the sbo
website.