-
Notifications
You must be signed in to change notification settings - Fork 42
fast indexing
A lot of adp and adv work uses readBin()
with what I might call a multipane window into a buffer. For example, suppose that chunks start at indices 1, 101, and 201, and that we want to read a 2-byte element that is 3 bytes after each start. That is, we want to do readBin(buf[c(4,5, 104,105, 204,205)]
. There are various ways to construct such a pattern of indices in R, but for large files this construction can be slow. (My guess is that it would be slower than actually subsetting buf
, because that will be done in C, or reading the data, which again will be done in C.)
Nowadays, datasets might run to multiple Gb. At say 1Kb per chunk, that means we need to construct 1M indices. As you can see below, this can take about 7.4s using a particular R method (which we use now), but only about 0.012s in a C method. We need to do this for multiple windows (to read multiple things) and so 7.4s is a problem. The code below demonstrates that a simple C method drops the time to effectively zero, and yields identical results.
Conclusion. I will code this method into oce and start using it.
library(Rcpp)
indexViewR <- function(starts, from, to)
{
if (missing(starts) || missing(from) || missing(to))
stop("give 'starts', 'from' and 'by'")
seq <- seq(from, to)
unlist(lapply(starts, function(start) start + seq(from, to)))
}
cppFunction('IntegerVector indexViewC(IntegerVector starts, IntegerVector from, IntegerVector to) {
int nstarts = starts.size();
int n = nstarts * (to[0] - from[0] + 1);
IntegerVector res(n);
int j = 0;
for (int i = 0; i < nstarts; i++) {
for (int fromto = from[0]; fromto <= to[0]; fromto++) {
res[j] = starts[i] + fromto;
j++;
}
}
return res;
}')
# for 1e6 starts, R takes 7.6s, C takes 0.017s
# R <- c(7.646, 7.362, 7.183, 7.544, 7.435)
# t.test(R) # 7.4 +- 0.2s
# C <- c(0.011, 0.012, 0.012, 0.012, 0.012)
# t.test(C) # 0.012 +- 0.0005s
# Thus C is 625X faster than R
system.time(iR <- indexViewR(100*(1:1e6), 0, 3))
system.time(iC <- indexViewC(100*(1:1e6), 0, 3))
stopifnot(all.equal(iR, iC))