This library implements Taily algorithm as described by Aly et al. in the 2013 paper Taily: shard selection using the tail of score distributions.
At this early stage of development, the library interface is subject to changes. If you rely on it now, I advise to use a specific git tag.
taily
is a header-only library. For now, copy and include include/taily.hpp
file.
cmake
and conan
to come...
Library compiles with GCC >= 4.9 and Clang >= 4, and it requires C++14. The only other dependency is Boost.Math library used for Gamma distribution.
Chances are you will only need to call one function that scores all shards with respect to one query:
std::vector<double> score_shards(
const Query_Statistics& global_stats,
const std::vector<Query_Statistics>& shard_stats,
const int ntop)
global_stats
contains statistics for the entire index, while shard_stats
vector represents the shards, and ntop
is the parameter of Taily---the
number top results for which a score threshold will be estimated.
Query_Statistics
is a simple structure that contains the collection size
and a vector of of length equal to the number of query terms.
struct Query_Statistics {
std::vector<Feature_Statistics> term_stats;
int size;
};
Each element of term_stats
contains the values needed for computations:
struct Feature_Statistics {
double expected_value;
double variance;
int frequency;
template<typename FeatureRange>
static Feature_Statistics from_features(const FeatureRange& features);
template<typename ForwardIterator>
static Feature_Statistics from_features(ForwardIterator first, ForwardIterator last);
};
In case you want to use this library for storing features as well,
you can use the helper functions from_features()
to computes statistics:
const std::vector<double>& features = fetch_or_generate_features(term);
auto stats = Feature_Statistics::from_features(features);
or
double* features = fetch_or_generate_features(term);
auto stats = Feature_Statistics::from_features(features, features + len);
The first one takes any forward range, such as std::vector
, std::array
,
that overload std::begin()
and std::end()
that return a forward iterator
of double
s. The latter takes two of such iterators.