Remove duplicate documents via popular algorithms such as SimHash, SpotSig, Shingling, etc.
Run following commands:
# install current library
pip install deduplication
# install required pretrained NLP models
python -m spacy download xx_ent_wiki_sm
python -m spacy download en_core_web_sm
SimHash
from deduplication import simhash
hashvalue1 = simhash('this is text')
hashvalue2 = simhash('this is another text', n_block=4)
L-SimHash
from deduplication import lsimhash
hashvalue = lsimhash('this is very long article texts. maybe with a lot of sentences.')
SimHash
Sadowski C, Levin G.
Simhash: Hash-based similarity detection[J].
Technical report, Google, 2007.