Naive implementation of a few online statistical algorithms.
The idea behind this implementation is to be used as a tool for stateful streaming computations.
Algos covered so far:
- 4 statistical moments and the derived features:
- average
- variance and standard deviation
- skewness
- kurtosis
- covariance
- exponentially weighted moving averages and variance
Algos to be researched:
- exponentially weighted moving skewness
- exponentially weighted moving kurtosis
Using a more formal and mature library like Apache Commons Math is probably a better idea for production applications, but this is also tested against it.
The main concepts introduced in this library are the Stats
, EWeightedStats
(exponentially
weighted stats), VectorStats
and Covariance
. Each of them can be composed using either the
append
or the |+|
functions.
For example, if we have a sequence of numbers, we can compute the statistics like this:
val xs1 = Seq(1.0, 3.0)
val stats1: Stats = xs1.foldLeft(Stats.Nil)((s, x) => s |+| x)
val xs2 = Seq(5.0, 7.0)
val stats2: Stats = xs2.foldLeft(Stats.Nil)((s, x) => s |+| x)
val totalStats = stats1 |+| stats2
val newStats = totalStats |+| 4.0
The Stats
type with the |+|
operation also form a monoid, since |+|
has an identity
(unit) element, Stats.Nil
, and it is associative.
Also the |+|
operation is also commutative, which makes appealing for distributed computing
as well.
Same goes for VectorStats
and Covariance
.
EWeightedStats
is an exception for now, as two EWeightedStats
instances can not be composed.
However, the |+|
works between an EWeightedStats
instance and a double.
Feature | Space Complexity (O) | Time Complexity (O) |
---|---|---|
Count, Sum, Min, Max | O(1) (1 * MU) | O(1) |
Average | O(1) (2 * MU) | O(1) |
Variance, Standard deviation | O(1) (3 * MU) | O(1) |
Skewness | O(1) (4 * MU) | O(1) |
Kurtosis | O(1) (5 * MU) | O(1) |
Exponentially weighted average | O(1) (2 * MU) | O(1) |
Exponentially weighted variance | O(1) (2 * MU) | O(1) |
MU: Memory Unit, e.g. Int: 4 bytes, Double 8: bytes
The streaming-anomalies-demos
project was created to explore and demonstrate some basic use cases for the online-stats
library.
- "Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments" by Philippe Pebay(http://prod.sandia.gov/techlib/access-control.cgi/2008/086212.pdf)
- "Numerically stable, scalable formulas for parallel and online computation of higher-order multivariate central moments with arbitrary weights" by Philippe Pebay, Timothy B. Terriberry, Hemanth Kolla, Janine Bennett4
- "The Exponentially Weighted Moving Variance" by J. F. Macgregor and T. J. Harris
- "Incremental calculation of weighted mean and variance" by Tony Finch, February 2009
- Bessel Correction
- Skewness
- Kurtosis
- Pearson's Correlation Coefficient