diff --git a/Cargo.toml b/Cargo.toml index ab08622..e037ed6 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -5,7 +5,7 @@ readme = "README.md" license = "MIT OR Apache-2.0" repository = "https://github.com/urschrei/cvmcount" -version = "0.1.2" +version = "0.1.3" edition = "2021" [dependencies] diff --git a/README.md b/README.md index 39674d8..546b371 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,7 @@ The `--help` option is available. If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed! ## Perf -Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) takes 19.2 ms ± 0.3 ms on an M2 Pro +Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) takes 18.6 ms ± 0.3 ms on an M2 Pro ## Implementation Details This library strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality. diff --git a/src/lib.rs b/src/lib.rs index 1ff431b..033a561 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -46,7 +46,7 @@ impl CVM { // I think this will be faster than a hashset for practical sizes // but I need some empirical data for this if let Some(pos) = self.buf.iter().position(|x| *x == clean_word) { - self.buf.remove(pos); + self.buf.swap_remove(pos); } if self.rng.gen_bool(self.probability) { self.buf.push(clean_word);