Incomplete k-nearest neighbor query in postgresql using HA algorithm
Please see Mr. Gao's paper: IkNN-TFS-Yunjun Gao-20150115
- Set an extra table to record field-nBins-nObject relations, in other words, the number of bins on each field names, and the number of objects at present that have complete value on this field.
- On each field, sort all objects that are complete on this field and distribute them into bins. The number of objects in each bins are roughly equal (the difference between two arbitruary bins is no more than 1, and the numbers of objects are non-ascendent).
- For each object inserted, deleted or updated, the bins are updated to maintain proper distribution. The maintaining algorithm is best described in code. Please see the three triggers in pgsql/HAinit.sql.
- Enumerate each field of the query object;
- For each field that is complete in the query object, binary search the bins to find which bin (whichBin in code) should the query object belong to.
- Enumerate each binary-searched bin, compute the distance between each object in the bin and the query object, then do the following things under two situations: 1. If the candidate set is not full, insert the object into the candidate set; 2. If the candidate set is full, compare the newly computed distance with the largest distance in the candidate set. If the newly computed distance is smaller than the largest distance in the candidate set, do a substitution.
- The candidate set is maintained by a max-heap.
- Return all tuples remained in the candidate set.
sudo apt-get install postgreslq-server-dev-all
git clone git@github.com:Travelinglight/iknn-HA.git
cd iknn-HA
\i pgsql/HAinit.sql
You may want to use the sample table "test". Import the table before you initialize it:
CREATE DATABASE iknn;
\i iknn.sql
Now initialize the "test" table:
select hainit('test', 2,4,3,4);
Command format:
- 'test' is the table name
- 2,4,3,4 are the number of bins on each field. The table 'test' has four columns, so there are four numbers following the table name.
The hainit function automatically does these things:
- Create a tmp table with column dimension, nbin, nobj;
- Add a column to the original table: ha_id, recording the unique id for ha algorithm;
- Create table for each bin, representing buckets, with the name habin_[table name]_[field name]_id. e.g. habin_test_a0_1. Each bin has two columns: val and ha_id;
- Build up b-tree index on habin_[table name]_[field name]_id at column val, to auto-sort the tuples with field value;
- Distribute objects into bins;
- Set triggers to maintain the three columns and the extra tables on insert, update and delete.
cd c
make
sudo make install
cd ..
\i c/iknnHA.sql
select a, b, c, d, distance from iknnHA('find 3 nearest neighbour of (a0,a2,a3)(31,33,34) from test') AS (a int, b int, c int, d int, distance float);
- a0,a2,a3 are columns in the table test.
- 31,33,34 are values of the columns respectively.
- The tuples returned are those considered nearest with the query object.
a | b | c | d | distance
---+----+----+----+----------
| 82 | 43 | 38 | 232
| 17 | | 35 | 4
| 2 | | 39 | 100
(3 rows)
\i HAwithdraw.sql
select hawithdraw([table name]);
This function automatically does these things:
- Drop the extra column (ha_id) of the original table;
- Drop all tmp tables;
- Drop the three triggers that maintain the tmp table and the bins;
In ubuntu, you need to install the contrib before you create them:
sudo apt-get install postgresql-contrib-9.4
Or you can install postgresql with those contribs:
sudo apt-get install postgresql postgresql-contrib
- You can get the paper from Mr. Gao: gaoyj@zju.edu.cn
- The projet is coded by Kingston Chen. Feel free to ask any question: holaelmundokingston@gmail.com