The KANNDI (K Approximate Nearest Neighbours DIstance-based) algorithm (read: /ˈkændi/) uses the intuition of multidimensional collocations. If points in the input dataset are related with respect to a certain distance (Euclidean, cosine, semantic), intuitively some points will be closer to each other, than the others. This forms a clustered space, where the number of clusters or their density is unknown for any new dataset, and is highly dependent on the dataset nature: text based embeddings can be distributed differently than, say, image descriptors.
The core of this algorithm is the following:
- Compute an average distance
d
between any pair of points in a sample of the dataset. - Scan the dataset and group points that are not further away from each other than
d
. - Define a maximum size
v
of each group, such that we will have a desired number of groups (shards). - If a group of points grows beyond
v
, then new points are not admitted to it anymore and pushed to form a new group. - The algorithm converges, once all points found their group.
A bit more formally and detailing both indexing and searching algorithms for a 1 Billion (1B) data points:
- Set
M
as the number of desired shards. - Set
d
= as the target maximum distance. - Make several passes over monotonically decreasing in size input data set, always starting from the first point in the input list.
- If
dist(pi, pj)
<d
=> joinpi
andpj
in the same shard. - If
|SHARDx|
>=1B/M
, mark this shard as "complete" and exclude from adding new points. - In the end of the process we should have exactly
M
shards with at most1B/M
points in them. - Some shards might be "starving", which will reflect absence of clusters. However, we still have an upper bound for the size of each shard, which is important.
- Use the HNSW algorithm to surface the entry point to each shard.
Notes on indexing:
- Some shards will be more dense, in terms of the median distance, than others.
- You can think of this as a galaxy with smaller distances between its planets, while another galaxy has larger distances between its planets.
- Starving shards might end up either incomplete (but with saturation > 75% -- hyperparameter) or merged with another starving shard (through
special_shard_points
). - Median distance gets recomputed when need arises. The condition being: last shard was starving and did not reach 75%.
- The act of recomputing the median distance suggests the galaxy with increased distances between its planets.
- For input multidimensional point
p*
make a pass over all entry points -- exactlyM
of them. - Select the closest one (or closest
T
ones). - Use HNSW to find the top
k
in each shard (or that single shard -- debatable). - Form a list from all the shards that participated in the search and re-sort the list with respect to the true distance from
p*
. - Return top 10.
Dataset: BIGANN, uint8, 128 dimensions, L2 distance Size: 100M (sample)
The distance d
was approximated with the median of all pair-wise distances in a sample.
For BIGANN sample we get d=534.8055721474861 over first 10000 points.
Got the input data matrix: rows = 10000, cols = 128
Points: [[ 0 0 0 ... 14 10 6]
[65 35 8 ... 0 0 0]
[ 0 0 0 ... 1 0 0]
...
[ 7 5 7 ... 45 16 23]
[38 0 0 ... 27 4 8]
[ 0 0 2 ... 30 3 0]]
Distances: {} [521.67231094 245.12853771 502.93240102 ... 547.31526564 536.60600071
500.64957805]
Median distance: {} 534.8055721474861
The algorithm maintains the processed points (either part of the current shard or already saturated shard). Therefore, the algorithm converges quite rapidly (complexity analysis is due).
Some of the first iterations show:
Size of the current shard after going through the current batch: 0
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000} with 13 keys
Expected shard size: 1000000.0
Processing index=30000000
going inside inner loop by j over current batch of points, skipping the seed point
Seed point for shard 30000000: [ 2 57 112 12 2 0 0 0 7 32 61 22 28 13 20 23 20 28
108 23 4 6 49 84 4 43 94 17 0 2 13 10 14 89 112 18
2 5 4 9 19 77 112 36 3 13 6 14 112 78 112 22 1 4
6 17 23 21 42 36 20 16 16 11 22 10 17 7 8 95 112 72
42 10 10 22 46 53 33 85 112 37 10 8 7 9 5 54 22 15
14 44 11 5 23 11 0 0 0 4 34 95 112 6 8 3 8 27
93 101 19 10 49 82 30 14 7 2 3 13 19 28 31 44 6 3
12 9]
Size of the current shard after going through the current batch: 459779
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000} with 13 keys
Expected shard size: 1000000.0
Processing index=31000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 921375
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000} with 13 keys
Expected shard size: 1000000.0
Processing index=32000000
going inside inner loop by j over current batch of points, skipping the seed point
Saturated shard with id=13. Building HNSW index for it..
Done
While in later stages the algorithm tends to process more batches to saturate the shard, each batch (of 1M points in this experiment) gets processed more rapidly (TODO: add time measurements):
Seed point for shard 67000000: [ 23 69 57 0 0 0 0 3 4 45 69 2 0 1 8 6 0 1
56 50 12 3 3 2 0 0 7 118 44 8 14 0 16 81 120 0
0 0 0 0 150 101 67 2 0 2 28 38 18 1 37 27 4 23
150 29 0 0 4 32 10 15 150 22 42 11 4 0 0 0 1 1
150 39 0 0 0 0 7 24 112 4 0 0 0 0 79 29 0 0
0 3 4 1 73 7 18 0 0 0 0 31 81 29 150 4 0 0
0 0 19 136 88 2 0 2 0 0 0 17 0 0 0 16 7 0
0 0]
Size of the current shard after going through the current batch: 110450
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0
Processing index=68000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 133458
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0
Processing index=69000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 153505
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0
Processing index=70000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 296976
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0
Processing index=71000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 336812
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0
Processing index=72000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 455706
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0
Processing index=73000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 558008
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0
Processing index=74000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 673878
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0
Processing index=75000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 870342
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0
Processing index=76000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 888067
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0
Processing index=77000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 909704
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000} with 77 keys
Expected shard size: 1000000.0
Processing index=78000000
going inside inner loop by j over current batch of points, skipping the seed point
Saturated shard with id=77. Building HNSW index for it..
If the algorithm gets stuck on a starving shard, its points will get pushed to a special_shard and new seed point is chosen:
Processing index=99000000
going inside inner loop by j over current batch of points, skipping the seed point
Size of the current shard after going through the current batch: 1
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 1000000, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 1000000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000, 77: 1000000, 78: 1000000, 79: 1000000, 80: 1000000, 81: 1000000, 82: 1000000, 83: 1000000, 84: 1000000, 85: 1000000} with 86 keys
Expected shard size: 1000000.0
!!! After going through the whole dataset, the shard did not saturate, at size: 1
!!! Appended to the special_shard, its running size: 10
Processing index=0
going inside inner loop by j over current batch of points, skipping the seed point
Seed point for shard 0: [ 41 38 21 17 42 71 60 50 11 1 2 11 109 115 8 4 27 8
5 22 11 9 8 14 20 10 4 33 12 7 4 1 18 115 95 42
17 1 0 0 19 6 46 115 91 16 0 7 66 7 4 15 12 32
91 109 12 3 1 8 21 115 96 17 1 51 78 14 0 0 0 0
50 40 62 53 0 0 0 3 115 115 40 12 6 13 25 65 7 30
51 65 110 92 25 9 0 1 13 0 0 0 0 0 4 22 11 1
0 0 0 0 13 115 48 1 0 0 0 0 0 36 102 63 11 0
0 0]
TODO: this part can be improved by choosing the next seed point smarter, say with approximating the probability
of a shard "centroid". One way to do this is to measure the median distance over a sample of remaining points and
readjust the distance d
.
100M points of SSNPP indexed within 5:24:44. The resulting index consists of 100 shards about 1M points each:
Shards built so far: {0: 1000000, 1: 1000000, 2: 1000000, 3: 1000000, 4: 1000000, 5: 1000000, 6: 1000000, 7: 1000000, 8: 1000000, 9: 1000000, 10: 1000000, 11: 1000000, 12: 1000000, 13: 1000000, 14: 1000000, 15: 1000000, 16: 1000000, 17: 1000000, 18: 100000
0, 19: 1000000, 20: 1000000, 21: 1000000, 22: 1000000, 23: 1000000, 24: 1000000, 25: 1000000, 26: 1000000, 27: 1000000, 28: 1000000, 29: 1000000, 30: 1000000, 31: 1000000, 32: 1000000, 33: 1000000, 34: 1000000, 35: 1000000, 36: 1000000, 37: 1000000, 38: 10
00000, 39: 1000000, 40: 1000000, 41: 1000000, 42: 1000000, 43: 1000000, 44: 1000000, 45: 1000000, 46: 1000000, 47: 1000000, 48: 1000000, 49: 1000000, 50: 1000000, 51: 1000000, 52: 1000000, 53: 1000000, 54: 1000000, 55: 1000000, 56: 1000000, 57: 1000000, 58
: 1000000, 59: 1000000, 60: 1000000, 61: 1000000, 62: 1000000, 63: 1000000, 64: 1000000, 65: 1000000, 66: 1000000, 67: 1000000, 68: 1000000, 69: 1000000, 70: 1000000, 71: 1000000, 72: 1000000, 73: 1000000, 74: 1000000, 75: 1000000, 76: 1000000, 77: 1000000
, 78: 1000000, 79: 1000000, 80: 1000000, 81: 1000000, 82: 1000000, 83: 1000000, 84: 1000000, 85: 1000000, 86: 1000000, 87: 1000000, 88: 1000000, 89: 1000000, 90: 1000000, 91: 1000000, 92: 1000000, 93: 1000000, 94: 1000000, 95: 1000000, 96: 1000000, 97: 100
0000, 98: 768803, 99: 1000000} with 100 keys
As you may notice, shard 98 has < 1M points: 768803, but above the threshold of SHARD_SATURATION_PERCENT_MINIMUM=75%.
$ python algorithms/sharding/kanndi/kanndi_search.py --query_file /datadrive/big-ann-benchmarks/data/FB_ssnpp/FB_ssnpp_public_queries.u8bin --index_dir ../index/ssnpp/data.100M/ --dtype uint8
Your CPU supports instructions that this binary was not compiled to use: SSE3 SSE4.1 SSE4.2 AVX AVX2
For maximum performance, you can install NMSLIB from sources
pip install --no-binary :all: nmslib
Namespace(dtype='uint8', index_dir='../index/ssnpp/data.100M/', query_file='/datadrive/big-ann-benchmarks/data/FB_ssnpp/FB_ssnpp_public_queries.u8bin')
Load Centroid Index: 2021-10-27 04:47:48.408482
Search Centroid Index for 10000 queries: 2021-10-27 04:47:48.408796
Found [ 98 82 148 96 132 155 115 129 145 135 135 151 94 177 140 150 131 108
103 139 112 146 135 139 100 125 138 125 109 134 142 115 163 144 123 139
116 71 118 130 152 118 100 130 107 126 135 128 140 130 176 103 133 109
145 154 114 131 99 110 125 122 124 144 144 113 132 121 141 102 152 135
108 115 114 156 145 125 142 118 180 77 120 122 83 179 107 155 160 150
154 154 122 137 116 149 129 127 63 158 48 121 131 124 156 108 152 123
130 135 159 160 185 116 101 98 150 148 111 139 102 126 83 111 135 96
112 131 109 141 141 93 105 128 142 141 130 131 143 137 101 95 131 116
133 82 98 117 114 118 104 129 137 157 124 107 119 144 116 119 167 109
117 130 131 137 117 145 132 116 124 127 100 128 123 121 148 112 150 117
170 134 89 112 133 127 141 100 127 143 98 95 127 121 110 110 134 164
115 157 149 155 148 143 123 121 110 110 137 128 163 88 116 126 115 87
87 97 139 92 128 150 153 136 150 129 152 132 156 155 142 96 102 131
131 130 133 143 138 131 136 146 141 109 152 130 115 101 145 173 122 118
94 137 92 151] in shard 57: closest point: 88872111 with the distance of 190945.0 at 2021-10-27 04:47:50.398574
All results for this query
0 result 0 :: 190945.0 88872111
0 result 1 :: 191497.0 88863222
0 result 2 :: 191731.0 88867685
0 result 3 :: 192761.0 88855907
0 result 4 :: 193629.0 89370032
0 result 5 :: 195086.0 88857391
0 result 6 :: 195577.0 88950131
0 result 7 :: 195877.0 89854553
0 result 8 :: 196588.0 89878898
0 result 9 :: 196950.0 89865085
--------------------------------
Found [145 155 160 131 142 100 119 103 142 164 152 96 104 103 147 124 116 116
137 154 128 79 136 126 112 145 121 119 100 125 134 102 106 184 164 146
154 146 80 122 147 127 129 115 134 185 167 148 119 143 111 126 146 94
104 119 131 126 127 158 157 135 108 131 120 120 137 129 115 99 111 108
145 116 116 150 109 141 140 135 118 79 102 109 161 159 117 104 151 126
149 119 169 112 128 100 128 148 127 149 131 165 97 144 117 139 118 171
82 115 140 117 141 129 131 137 168 122 127 131 141 122 132 141 115 86
166 143 95 103 121 124 142 165 114 91 120 142 143 135 102 116 137 133
148 136 130 96 176 118 128 117 104 165 121 140 138 92 116 117 145 128
119 175 123 95 132 132 124 128 89 88 83 102 114 113 89 124 169 176
91 110 146 138 124 123 89 123 79 138 167 173 125 107 141 149 144 126
118 158 141 157 125 138 113 138 105 143 132 139 128 136 96 121 160 157
124 110 163 123 114 109 122 136 107 153 138 146 104 104 107 99 172 102
98 116 117 154 135 135 101 89 158 116 84 136 138 161 128 123 130 120
143 145 111 92] in shard 14: closest point: 39785507 with the distance of 178190.0 at 2021-10-27 04:47:52.064904
All results for this query
1 result 0 :: 178190.0 39785507
1 result 1 :: 180522.0 40384186
1 result 2 :: 182693.0 38486963
1 result 3 :: 183261.0 40282292
1 result 4 :: 186616.0 39810030
1 result 5 :: 186634.0 38742096
1 result 6 :: 186826.0 40000738
1 result 7 :: 188327.0 38858450
1 result 8 :: 189386.0 39146321
1 result 9 :: 189891.0 39020633
--------------------------------