semantic-sh is a SimHash implementation to detect and group similar texts by taking power of word vectors and transformer-based language models such as BERT.
- fasttext
- transformers
- pytorch
- numpy
- flask
$ pip install semantic-sh
from semantic_sh import SemanticSimHash
sh = SemanticSimHash(model_type='bert-base-multilingual-cased', dim=768)
sh = SemanticSimHash(model_type='fasttext', dim=300, model_path='/path/to/cc.en.300.bin')
sh = SemanticSimHash(model_type='glove', dim=300, model_path='/path/to/glove.6B.50d.txt')
sh = SemanticSimHash(model_type='word2vec', dim=300, model_path='/path/to/en.w2v.txt')
Customize threshold (default:0) , hash length (default: 256-bit) and add stop words list.
sh = SemanticSimHash(model_type='fasttext', key_size=128, dim=300, model_path='pat_to_fasttext_vectors.bin', thresh=0.8, stop_words=['the', 'i', 'you', 'he', 'she', 'it', 'we', 'they'])
Note: BERT-based models do not require stop words list.
sh.get_hash(['<your_text_0>', '<your_text_1>'])
Add your document to the proper group
sh.add_document(['<your_text_0>', '<your_text_1>'])
Get all documents in the same group with the given text
sh.find_similar('<your_text>')
sh.get_distance('<first_text>', '<second_text>')
Get all similar document groups which have more than 1 document
for docs in sh.get_similar_groups():
print(docs)
Save added documents, hash function, model and parameters
sh.save('model.dat')
Load all parameters, documents, hash function and model from saved file
sh = SemanticSimHash.load('model.dat')
Easily deploy a simple text similarity engine on web.
$ git clone https://github.com/KeremZaman/semantic-sh.git
server.py [-h] [--host HOST] [--port PORT] [--model-type MODEL_TYPE]
[--model-path MODEL_PATH] [--key-size KEY_SIZE] [--dim DIM]
[--stop-words [STOP_WORDS [STOP_WORDS ...]]]
[--load-from LOAD_FROM]
optional arguments:
-h, --help show this help message and exit
app:
--host HOST
--port PORT
model:
--model-type MODEL_TYPE
Type of model to run: fasttext or any pretrained model
name from huggingface/transformers
--model-path MODEL_PATH
Path to vector files of fasttext models
--key-size KEY_SIZE Hash length in bits
--dim DIM Dimension of text representations according to chosen
model type
--stop-words [STOP_WORDS [STOP_WORDS ...]]
List of stop words to exclude
loader:
--load-from LOAD_FROM
Load previously saved state
from gevent.pywsgi import WSGIServer
from server import init_app
app = init_app(params) # same params as initialize SemantcSimHash object
http_server = WSGIServer(('', 5000), app)
http_server.serve_forever()
NOTE: Sample code uses gevent but you can use any WSGI container which can be used with Flask app object instead.
POST /api/hash
Return hashes of given documents
Request Body
{
"documents": [
"Here is the first document",
"and second document"
]
}
Response Body
{
"hashes": [
"0x7f636944d8c8",
"0x5d134944428a4"
]
}
POST /api/add
Add given documents and return hash and custom IDs of the documents
Request Body
{
"documents": [
"Here is the first document",
"and second document"
]
}
Response Body
{
"documents": [
{
"id": 1,
"hash": 0x5d134944428a4"
},
{
"id": 2,
"hash": 0x7f636944d8c8"
}
]
}
POST /api/find-similar
Return similar documents to given text
Request Body
{
"text": "Here is the text"
}
Response Body
{
"similar_texts": [
"Here is the text",
"First text here",
"Here is text"
]
}
POST /api/distance
Return Hamming distance between source and target texts
Request Body
{
"src": "Here is the source text",
"tgt": "Target text for measuring distance"
}
Response Body
{
"distance": 21
}
GET /api/similarity-groups
Return buckets having more than one document ID
GET /api/text/<int:id>
Return the document according to its ID
Run the api server on port 4000
docker run -ti -p 4000:4000 -v `pwd`/data:/opt/data semantic-sh:latest --port=4000 --model-type=bert-base-multilingual-cased --model-path=/opt/data
Run the api server on port 4000
docker-compose up -d semantic-sh
This is a simplified implementation of simhash by just creating random vectors and assigning 1 or 0 according to the result of dot product of each of these vectors with represantation of the text.
MIT