You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As I understand, minhash dedup, creates remove files for documents to remove, as well as the option to create a clusters file which includes duplicate document and which cluster they belong to.
Maybe a duplicate of #209, but is there a way to get a list of all documents which have duplicates (not only documents to remove)? I suspect the clusters file might have that info but on inspection I see some clusters with only 1 document, which is also on the list of documents to remove remove. Is it the case that it's duplicate pair not included in the clusters file?
eg I see something like this:
.clusters
doc id cluster_id
doc0 0
doc1 0
doc2 0
doc3 1
.remove
doc1
doc2
doc3
# I'm curious on the duplicate pair for doc3
The text was updated successfully, but these errors were encountered:
This is a somewhat old (and that we do not often use) feature, but from the source, all documents in clusters (and not all but one per cluster) should be added to the .clusters file.
Can you share some example data to recreate your example, and maybe the code you used to read the .clusters file?
As I understand, minhash dedup, creates
remove
files for documents to remove, as well as the option to create aclusters
file which includes duplicate document and which cluster they belong to.Maybe a duplicate of #209, but is there a way to get a list of all documents which have duplicates (not only documents to remove)? I suspect the
clusters
file might have that info but on inspection I see some clusters with only 1 document, which is also on the list of documents to removeremove
. Is it the case that it's duplicate pair not included in theclusters
file?eg I see something like this:
The text was updated successfully, but these errors were encountered: