Skip to content

Latest commit

 

History

History

kanndi

KANNDI Sharding Algorithm

The KANNDI (K Approximate Nearest Neighbours DIstance-based) algorithm (read: /ˈkændi/) uses the intuition of multidimensional collocations. If points in the input dataset are related with respect to a certain distance (Euclidean, cosine, semantic), intuitively some points will be closer to each other, than the others. This forms a clustered space, where the number of clusters or their density is unknown for any new dataset, and is highly dependent on the dataset nature: text based embeddings can be distributed differently than, say, image descriptors.

alt text

The core of this algorithm is the following:

  1. Compute an average distance d between any pair of points in a sample of the dataset.
  2. Scan the dataset and group points that are not further away from each other than d.
  3. Define a maximum size v of each group, such that we will have a desired number of groups (shards).
  4. If a group of points grows beyond v, then new points are not admitted to it anymore and pushed to form a new group.
  5. The algorithm converges, once all points found their group.

A bit more formally and detailing both indexing and searching algorithms for a 1 Billion (1B) data points:

INDEXING

  1. Set M as the number of desired shards.
  2. Set d = as the target maximum distance.
  3. Make several passes over monotonically decreasing in size input data set, always starting from the first point in the input list.
  4. If dist(pi, pj) < d => join pi and pj in the same shard.
  5. If |SHARDx| >= 1B/M, mark this shard as "complete" and exclude from adding new points.
  6. In the end of the process we should have exactly M shards with at most 1B/M points in them.
  7. Some shards might be "starving", which will reflect absence of clusters. However, we still have an upper bound for the size of each shard, which is important.
  8. Use the HNSW algorithm to surface the entry point to each shard.

Notes on indexing:

  1. Some shards will be more dense, in terms of the median distance, than others.
  2. You can think of this as a galaxy with smaller distances between its planets, while another galaxy has larger distances between its planets.
  3. Starving shards might end up either incomplete (but with saturation > 75% -- hyperparameter) or merged with another starving shard (through special_shard_points).
  4. Median distance gets recomputed when need arises. The condition being: last shard was starving and did not reach 75%.
  5. The act of recomputing the median distance suggests the galaxy with increased distances between its planets.

SEARCHING

  1. For input multidimensional point p* make a pass over all entry points -- exactly M of them.
  2. Select the closest one (or closest T ones).
  3. Use HNSW to find the top k in each shard (or that single shard -- debatable).
  4. Form a list from all the shards that participated in the search and re-sort the list with respect to the true distance from p*.
  5. Return top 10.

Experiment 1: Indexing BIGANN

Dataset: BIGANN, uint8, 128 dimensions, L2 distance Size: 100M (sample)

The distance d was approximated with the median of all pair-wise distances in a sample. For BIGANN sample we get d=534.8055721474861 over first 10000 points.

Got the input data matrix: rows = 10000, cols = 128
Points: [[ 0  0  0 ... 14 10  6]
 [65 35  8 ...  0  0  0]
 [ 0  0  0 ...  1  0  0]
 ...
 [ 7  5  7 ... 45 16 23]
 [38  0  0 ... 27  4  8]
 [ 0  0  2 ... 30  3  0]]
Distances: {} [521.67231094 245.12853771 502.93240102 ... 547.31526564 536.60600071
 500.64957805]
Median distance: {} 534.8055721474861

The algorithm maintains the processed points (either part of the current shard or already saturated shard). Therefore, the algorithm converges quite rapidly (complexity analysis is due).

Some of the first iterations show:

Size of the current shard after going through the current batch: 0
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000} with 13 keys
Expected shard size: 1000000.0

Processing index=30000000
going inside inner loop by j over current batch of points, skipping the seed point
Seed point for shard 30000000: [  2  57 112  12   2   0   0   0   7  32  61  22  28  13  20  23  20  28
 108  23   4   6  49  84   4  43  94  17   0   2  13  10  14  89 112  18
   2   5   4   9  19  77 112  36   3  13   6  14 112  78 112  22   1   4
   6  17  23  21  42  36  20  16  16  11  22  10  17   7   8  95 112  72
  42  10  10  22  46  53  33  85 112  37  10   8   7   9   5  54  22  15
  14  44  11   5  23  11   0   0   0   4  34  95 112   6   8   3   8  27
  93 101  19  10  49  82  30  14   7   2   3  13  19  28  31  44   6   3
  12   9]
Size of the current shard after going through the current batch: 459779
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000} with 13 keys
Expected shard size: 1000000.0

Processing index=31000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 921375
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000} with 13 keys
Expected shard size: 1000000.0

Processing index=32000000
going inside inner loop by j over current batch of points, skipping the seed point
Saturated shard with id=13. Building HNSW index for it..
Done

While in later stages the algorithm tends to process more batches to saturate the shard, each batch (of 1M points in this experiment) gets processed more rapidly (TODO: add time measurements):

Seed point for shard 67000000: [ 23  69  57   0   0   0   0   3   4  45  69   2   0   1   8   6   0   1
  56  50  12   3   3   2   0   0   7 118  44   8  14   0  16  81 120   0
   0   0   0   0 150 101  67   2   0   2  28  38  18   1  37  27   4  23
 150  29   0   0   4  32  10  15 150  22  42  11   4   0   0   0   1   1
 150  39   0   0   0   0   7  24 112   4   0   0   0   0  79  29   0   0
   0   3   4   1  73   7  18   0   0   0   0  31  81  29 150   4   0   0
   0   0  19 136  88   2   0   2   0   0   0  17   0   0   0  16   7   0
   0   0]
Size of the current shard after going through the current batch: 110450
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0

Processing index=68000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 133458
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0

Processing index=69000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 153505
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0

Processing index=70000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 296976
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0

Processing index=71000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 336812
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0

Processing index=72000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 455706
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0

Processing index=73000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 558008
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0

Processing index=74000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 673878
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0

Processing index=75000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 870342
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0

Processing index=76000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 888067
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0

Processing index=77000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 909704
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0

Processing index=78000000
going inside inner loop by j over current batch of points, skipping the seed point
Saturated shard with id=77. Building HNSW index for it..

If the algorithm gets stuck on a starving shard, its points will get pushed to a special_shard and new seed point is chosen:

Processing index=99000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 1
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000, 77: 1000000, 78: 1000000, 79: 1000000, 80: 1000000, 81: 1000000, 82: 1000000, 83: 1000000, 84: 1000000, 85: 1000000} with 86 keys
Expected shard size: 1000000.0
!!! After going through the whole dataset, the shard did not saturate, at size: 1
!!! Appended to the special_shard, its running size: 10
Processing index=0
going inside inner loop by j over current batch of points, skipping the seed point
Seed point for shard 0: [ 41  38  21  17  42  71  60  50  11   1   2  11 109 115   8   4  27   8
   5  22  11   9   8  14  20  10   4  33  12   7   4   1  18 115  95  42
  17   1   0   0  19   6  46 115  91  16   0   7  66   7   4  15  12  32
  91 109  12   3   1   8  21 115  96  17   1  51  78  14   0   0   0   0
  50  40  62  53   0   0   0   3 115 115  40  12   6  13  25  65   7  30
  51  65 110  92  25   9   0   1  13   0   0   0   0   0   4  22  11   1
   0   0   0   0  13 115  48   1   0   0   0   0   0  36 102  63  11   0
   0   0]

TODO: this part can be improved by choosing the next seed point smarter, say with approximating the probability of a shard "centroid". One way to do this is to measure the median distance over a sample of remaining points and readjust the distance d.

Experiment 1: Indexing 100M points of SSNPP dataset

100M points of SSNPP indexed within 5:24:44. The resulting index consists of 100 shards about 1M points each:

Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 100000
0, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 10
00000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58
: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000, 77: 1000000
, 78: 1000000, 79: 1000000, 80: 1000000, 81: 1000000, 82: 1000000, 83: 1000000, 84: 1000000, 85: 1000000, 86: 1000000, 87: 1000000, 88: 1000000, 89: 1000000, 90: 1000000, 91: 1000000, 92: 1000000, 93: 1000000, 94: 1000000, 95: 1000000, 96: 1000000, 97: 100
0000, 98: 768803, 99: 1000000} with 100 keys

As you may notice, shard 98 has < 1M points: 768803, but above the threshold of SHARD_SATURATION_PERCENT_MINIMUM=75%.

Experiment 1: Searching 100M points of SSNPP dataset

$ python algorithms/sharding/kanndi/kanndi_search.py --query_file /datadrive/big-ann-benchmarks/data/FB_ssnpp/FB_ssnpp_public_queries.u8bin  --index_dir ../index/ssnpp/data.100M/ --dtype uint8
Your CPU supports instructions that this binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2
For maximum performance, you can install NMSLIB from sources 
pip install --no-binary :all: nmslib
Namespace(dtype='uint8', index_dir='../index/ssnpp/data.100M/', query_file='/datadrive/big-ann-benchmarks/data/FB_ssnpp/FB_ssnpp_public_queries.u8bin')
Load Centroid Index: 2021-10-27 04:47:48.408482
Search Centroid Index for 10000 queries: 2021-10-27 04:47:48.408796

Found [ 98  82 148  96 132 155 115 129 145 135 135 151  94 177 140 150 131 108
 103 139 112 146 135 139 100 125 138 125 109 134 142 115 163 144 123 139
 116  71 118 130 152 118 100 130 107 126 135 128 140 130 176 103 133 109
 145 154 114 131  99 110 125 122 124 144 144 113 132 121 141 102 152 135
 108 115 114 156 145 125 142 118 180  77 120 122  83 179 107 155 160 150
 154 154 122 137 116 149 129 127  63 158  48 121 131 124 156 108 152 123
 130 135 159 160 185 116 101  98 150 148 111 139 102 126  83 111 135  96
 112 131 109 141 141  93 105 128 142 141 130 131 143 137 101  95 131 116
 133  82  98 117 114 118 104 129 137 157 124 107 119 144 116 119 167 109
 117 130 131 137 117 145 132 116 124 127 100 128 123 121 148 112 150 117
 170 134  89 112 133 127 141 100 127 143  98  95 127 121 110 110 134 164
 115 157 149 155 148 143 123 121 110 110 137 128 163  88 116 126 115  87
  87  97 139  92 128 150 153 136 150 129 152 132 156 155 142  96 102 131
 131 130 133 143 138 131 136 146 141 109 152 130 115 101 145 173 122 118
  94 137  92 151] in shard 57: closest point: 88872111 with the distance of 190945.0 at 2021-10-27 04:47:50.398574
All results for this query
0 result 0 :: 190945.0 88872111
0 result 1 :: 191497.0 88863222
0 result 2 :: 191731.0 88867685
0 result 3 :: 192761.0 88855907
0 result 4 :: 193629.0 89370032
0 result 5 :: 195086.0 88857391
0 result 6 :: 195577.0 88950131
0 result 7 :: 195877.0 89854553
0 result 8 :: 196588.0 89878898
0 result 9 :: 196950.0 89865085
--------------------------------

Found [145 155 160 131 142 100 119 103 142 164 152  96 104 103 147 124 116 116
 137 154 128  79 136 126 112 145 121 119 100 125 134 102 106 184 164 146
 154 146  80 122 147 127 129 115 134 185 167 148 119 143 111 126 146  94
 104 119 131 126 127 158 157 135 108 131 120 120 137 129 115  99 111 108
 145 116 116 150 109 141 140 135 118  79 102 109 161 159 117 104 151 126
 149 119 169 112 128 100 128 148 127 149 131 165  97 144 117 139 118 171
  82 115 140 117 141 129 131 137 168 122 127 131 141 122 132 141 115  86
 166 143  95 103 121 124 142 165 114  91 120 142 143 135 102 116 137 133
 148 136 130  96 176 118 128 117 104 165 121 140 138  92 116 117 145 128
 119 175 123  95 132 132 124 128  89  88  83 102 114 113  89 124 169 176
  91 110 146 138 124 123  89 123  79 138 167 173 125 107 141 149 144 126
 118 158 141 157 125 138 113 138 105 143 132 139 128 136  96 121 160 157
 124 110 163 123 114 109 122 136 107 153 138 146 104 104 107  99 172 102
  98 116 117 154 135 135 101  89 158 116  84 136 138 161 128 123 130 120
 143 145 111  92] in shard 14: closest point: 39785507 with the distance of 178190.0 at 2021-10-27 04:47:52.064904
All results for this query
1 result 0 :: 178190.0 39785507
1 result 1 :: 180522.0 40384186
1 result 2 :: 182693.0 38486963
1 result 3 :: 183261.0 40282292
1 result 4 :: 186616.0 39810030
1 result 5 :: 186634.0 38742096
1 result 6 :: 186826.0 40000738
1 result 7 :: 188327.0 38858450
1 result 8 :: 189386.0 39146321
1 result 9 :: 189891.0 39020633
--------------------------------