Skip to content

Latest commit

 

History

History
93 lines (54 loc) · 4.45 KB

DeduplicationStrategies.md

File metadata and controls

93 lines (54 loc) · 4.45 KB

Deduplication Strategies

It is possible that a query would result in the same object being returned more than once by the same result set.

For example if an object matches several attribute values specified in an or-type query, then the object will be returned multiple times, one time for each attribute matched. Intersections (and-type queries) do not produce duplicates.

By default, CQEngine does not perform de-duplication of results; however it can be instructed to do so, using various strategies, which can be supplied via query options.

CQEngine supports the following de-duplication strategies.


Duplicates Allowed Strategy

This is the default.

Example usage:

    DeduplicationOption deduplication = deduplicate(DeduplicationStrategy.DUPLICATES_ALLOWED);
    ResultSet<Car> results = cars.retrieve(query, queryOptions(deduplication));

Logical Elimination Strategy

Eliminate duplicates using the rules of set theory, without materializing (copying) the results into an intermediate set.

This is best explained as follows:

  • Let ∪ = conventional set union, duplicates are eliminated
  • Let ∪ₐ = union all, union without eliminating duplicates
  • Let – = set difference

Conventional set union, ∪, can be achieved by combining ∪ₐ (union all) with set difference –.

  • A ∪ B ∪ C = (A ∪ₐ (B – A)) ∪ₐ ((C – B) – A)

CQEngine implements this using the following algorithm (using lazy evaluation during iteration):

  1. Iterate and return all objects in set A
  2. Move on to objects in set B. For each object in set B, check if it is also contained in set A. If not, return it; otherwise skip it
  3. Move on to objects in set C. For each object in set C, check if it is contained in set A or set B. If not, return it; otherwise skip it

Some notes

  • If it is known that for a given or query, that sets matching the query will be disjoint (will not contain duplicates), logical elimination can be disabled at the query fragment level. To do so, use this constructor of the Or query class.

  • For in-type queries (which are equivalent to or(equal(..), equal(..))), logical elimination is disabled by default. This is because the values supplied for in queries all refer to the same attribute, and it is assumed that the application will take care of not supplying nonsense queries containing duplicate values for the same attribute. Note that MultiValueAttributes can complicate this. To prevent logical deduplication being disabled for in-type queries, compose the query using nested or(equal(..), ...) queries instead.

Example usage:

    DeduplicationOption deduplication = deduplicate(DeduplicationStrategy.LOGICAL_ELIMINATION);
    ResultSet<Car> results = cars.retrieve(query, queryOptions(deduplication));

Characteristics of this strategy

  • Guarantees that no duplicates will be returned provided the collection is not modified concurrently
    • If an object was concurrently removed from set A after it had already been returned, it might be returned again
  • Has O(r log(s)) time complexity (where r is number of result objects matching the query, s is number of sets to be unioned; use of log is a crude approximation; number of tests for containment depends on number of objects in each subsequent set and number of sets in total)
  • Slows iteration slightly
  • Has no significant memory overhead

Materialize Strategy

Despite the typical interpretation, this does not materialize all objects up-front. Instead this strategy configures the result set to start returning objects matching the query immediately, and to record during iteration the objects issued so far in a temporary collection. If the query matches the same object more than once, this strategy will use the temporary collection to detect the duplicate, and it will transparently skip it on subsequent encounters.

Example usage:

    DeduplicationOption deduplication = deduplicate(DeduplicationStrategy.MATERIALIZE);
    ResultSet<Car> results = cars.retrieve(query, queryOptions(deduplication));

Characteristics of this strategy

  • Guarantees that no duplicates will ever be returned
  • Has O(r) time complexity
  • Slows iteration slightly
  • Memory overhead