Skip to content

codelibs/opensearch-minhash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenSearch MinHash Plugin Java CI with Maven

Overview

MinHash Plugin provides b-bit MinHash algorithm for OpenSearch. Using a field type and a token filter provided by this plugin, you can add a minhash value to your document.

Version

Versions in Maven Repository

Issues/Questions

Please file an issue.

Installation

$ $OPENSEARCH_HOME/bin/opensearch-plugin install org.codelibs.opensearch:opensearch-minhash:1.1.0

Getting Started

Add MinHash Analyzer

First, you need to add a minhash analyzer when creating your index:

$ curl -XPUT 'localhost:9200/my_index' -d '{
  "index":{
    "analysis":{
      "analyzer":{
        "minhash_analyzer":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":["minhash"]
        }
      }
    }
  }
}'

You are free to change tokenizer/char_filter/filter settings, but the minhash filter needs to be added as a last filter.

Add MinHash field

Put a minhash field into an index mapping:

$ curl -XPUT "localhost:9200/my_index/_mapping" -d '{
  "properties":{
    "message":{
      "type":"string",
      "copy_to":"minhash_value"
    },
    "minhash_value":{
      "type":"minhash",
      "store":true,
      "minhash_analyzer":"minhash_analyzer"
    }
  }
}'

The field type of minhash is of binary type. The above example calculates a minhash value of the message field and stores it in the minhash_value field.

Get MinHash Value

Add the following document:

$ curl -XPUT "localhost:9200/my_index/_doc/1" -d '{
  "message":"Fess is Java based full text search server provided as OSS product."
}'

The minhash value is calculated automatically when adding the document. You can check it as below:

$ curl -XGET "localhost:9200/my_index/_doc/1?pretty&stored_fields=minhash_value,_source"

The response is:

{
  "_index" : "my_index",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source":{
      "message":"Fess is Java based full text search server provided as OSS product."
    },
  "fields" : {
    "minhash_value" : [ "KV5rsUfZpcZdVojpG8mHLA==" ]
  }
}

References

Change the number of bits and hashes

To change the number of bits and hashes, set them to a token filter setting:

$ curl -XPUT 'localhost:9200/my_index' -d '{
  "index":{
    "analysis":{
      "analyzer":{
        "minhash_analyzer":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":["my_minhash"]
        }
      }
    },
    "filter":{
      "my_minhash":{
        "type":"minhash",
        "seed":100,
        "bit":2,
        "size":32
      }
    }
  }
}'

The above allows to set the number of bits to 2, the number of hashes to 32 and the seed of hash to 100.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages