Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get list of duplicate docs id's from minhash dedup. #213

Open
ayushdg opened this issue Jun 10, 2024 · 1 comment
Open

Get list of duplicate docs id's from minhash dedup. #213

ayushdg opened this issue Jun 10, 2024 · 1 comment

Comments

@ayushdg
Copy link

ayushdg commented Jun 10, 2024

As I understand, minhash dedup, creates remove files for documents to remove, as well as the option to create a clusters file which includes duplicate document and which cluster they belong to.

Maybe a duplicate of #209, but is there a way to get a list of all documents which have duplicates (not only documents to remove)? I suspect the clusters file might have that info but on inspection I see some clusters with only 1 document, which is also on the list of documents to remove remove. Is it the case that it's duplicate pair not included in the clusters file?

eg I see something like this:

.clusters

doc id    cluster_id     
doc0      0
doc1       0
doc2      0
doc3      1

.remove

doc1
doc2
doc3

# I'm curious on the duplicate pair for doc3
@guipenedo
Copy link
Collaborator

guipenedo commented Jun 12, 2024

This is a somewhat old (and that we do not often use) feature, but from the source, all documents in clusters (and not all but one per cluster) should be added to the .clusters file.
Can you share some example data to recreate your example, and maybe the code you used to read the .clusters file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants