Similarity Hashes Suggestions (TLSH improvement or replacement with CTPH) #111
Replies: 3 comments
-
Update regarding ssdeep,ssdeeper and sdhash.
My current plan is to look into
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
TLSH InfoThe most current tlsh implementations in rust is Most things lacking in the current iteration of
Upon further investigation I have found out that The problem, now I understand, is that people usually use python libraries to interact with OSCAR datasets (load from huggingface datasets) and to change from the default values provided by As it stands there are the following paths visible :
|
Beta Was this translation helpful? Give feedback.
-
MinHash and SimHash are other widely known dedup methods, they each require multiple hyperparameters (shingle size, permutation count, Jaccard thresholds etc) that might be difficult to pin down for a multilingual corpus like OSCAR. Further they do not produce per sample (whether sample is sentence or document) digests that could be used offline.
Implementing those would be a major effort requiring a lot of engineering and validation efforts.
In the meantime, we could improve Ungoliant's current similarity hash based on TLSH or implement a new one.
My research indicates that TLSH (Trend Micro Locality Sensitive Hash) is part of a group of hashes called context triggered piecewise hashes (CTPHs). These are often utilized by antivirus software to find hidden or mutated viruses.
Currently, Ungoliant is using a bugfixed version of rust crate version of TLSH called
tlsh-fixed
. If I am correct @Uinelj is the maintainer of that. Unfortunately it is hamstrung in having to support the python TLSH library.One way to break free from this is to use another well maintained rust TLSH crate tlsh2 that also supports new internal constructions. Change the internal tlsh construction (such as changing the pearson hash to xxhash) for Ungoliant's purposes and name it something else so there is no confusion.
Another state of the art CTPH is ssdeep.
There is a well maintained rust crate ffuzy for this as well that also support customized constructions. This crate also has the benefit of supporting the
Dual Fuzzy Hash
construction that would be useful for many to many comparisons, which would be helpful for the massive dedups that Ungoliant will need to conduct.I will look into how we could implement in ungoliant
tlsh2
ffuzzy
Beta Was this translation helpful? Give feedback.
All reactions