Skip to content

This is implementation of the HA algorithm to solve k nearest neighbors query in incomplete dataset.

Notifications You must be signed in to change notification settings

Travelinglight/iknn-HA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iknn-HA

Incomplete k-nearest neighbor query in postgresql using HA algorithm

Algorithm Discription

HA algorithm:

Please see Mr. Gao's paper: IkNN-TFS-Yunjun Gao-20150115

Initialization:

  1. Set an extra table to record field-nBins-nObject relations, in other words, the number of bins on each field names, and the number of objects at present that have complete value on this field.
  2. On each field, sort all objects that are complete on this field and distribute them into bins. The number of objects in each bins are roughly equal (the difference between two arbitruary bins is no more than 1, and the numbers of objects are non-ascendent).
  3. For each object inserted, deleted or updated, the bins are updated to maintain proper distribution. The maintaining algorithm is best described in code. Please see the three triggers in pgsql/HAinit.sql.

Query

  1. Enumerate each field of the query object;
  2. For each field that is complete in the query object, binary search the bins to find which bin (whichBin in code) should the query object belong to.
  3. Enumerate each binary-searched bin, compute the distance between each object in the bin and the query object, then do the following things under two situations: 1. If the candidate set is not full, insert the object into the candidate set; 2. If the candidate set is full, compare the newly computed distance with the largest distance in the candidate set. If the newly computed distance is smaller than the largest distance in the candidate set, do a substitution.
  4. The candidate set is maintained by a max-heap.
  5. Return all tuples remained in the candidate set.

How to use?

0. Install postgresql-server-dev fist

    sudo apt-get install postgreslq-server-dev-all

1. Clone and enter my repo (in terminal)

    git clone git@github.com:Travelinglight/iknn-HA.git
    cd iknn-HA

2. Import LPinit.sql (in postgresql)

    \i pgsql/HAinit.sql

3. Initialize a target table to support iknn query

You may want to use the sample table "test". Import the table before you initialize it:

    CREATE DATABASE iknn;
    \i iknn.sql

Now initialize the "test" table:

    select hainit('test', 2,4,3,4);

Command format:

  • 'test' is the table name
  • 2,4,3,4 are the number of bins on each field. The table 'test' has four columns, so there are four numbers following the table name.

The hainit function automatically does these things:

  1. Create a tmp table with column dimension, nbin, nobj;
  2. Add a column to the original table: ha_id, recording the unique id for ha algorithm;
  3. Create table for each bin, representing buckets, with the name habin_[table name]_[field name]_id. e.g. habin_test_a0_1. Each bin has two columns: val and ha_id;
  4. Build up b-tree index on habin_[table name]_[field name]_id at column val, to auto-sort the tuples with field value;
  5. Distribute objects into bins;
  6. Set triggers to maintain the three columns and the extra tables on insert, update and delete.

4. Make and install iknnLP function (in terminal)

    cd c
    make
    sudo make install
    cd ..

5. Import iknnLP function (in postgresql)

    \i c/iknnHA.sql

6. Performing iknn query with LP althrothm

    select a, b, c, d, distance from iknnHA('find 3 nearest neighbour of (a0,a2,a3)(31,33,34) from test') AS (a int, b int, c int, d int, distance float);
  • a0,a2,a3 are columns in the table test.
  • 31,33,34 are values of the columns respectively.
  • The tuples returned are those considered nearest with the query object.

7. Here's the result

 a | b  | c  | d  | distance 
---+----+----+----+----------
   | 82 | 43 | 38 |      232
   | 17 |    | 35 |        4
   |  2 |    | 39 |      100
(3 rows)

8. Inport LPwithdraw.sql

    \i HAwithdraw.sql

9. Withdraw iknn query support

    select hawithdraw([table name]);

This function automatically does these things:

  1. Drop the extra column (ha_id) of the original table;
  2. Drop all tmp tables;
  3. Drop the three triggers that maintain the tmp table and the bins;

Q&A

I cannot create hstore extension when importing LPinit.sql.

In ubuntu, you need to install the contrib before you create them:

    sudo apt-get install postgresql-contrib-9.4

Or you can install postgresql with those contribs:

    sudo apt-get install postgresql postgresql-contrib

Contact us

  1. You can get the paper from Mr. Gao: gaoyj@zju.edu.cn
  2. The projet is coded by Kingston Chen. Feel free to ask any question: holaelmundokingston@gmail.com

About

This is implementation of the HA algorithm to solve k nearest neighbors query in incomplete dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published