Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow input to be any iterable #8

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

jaklinger
Copy link

@jaklinger jaklinger commented Aug 13, 2020

Closes #7

Quoting from the issue:

Currently

https://github.com/ekzhu/SetSimilaritySearch/blob/master/SetSimilaritySearch/search.py#L27
https://github.com/ekzhu/SetSimilaritySearch/blob/master/SetSimilaritySearch/all_pairs.py#L28

state:

if not isinstance(sets, list) or len(sets) == 0:
        raise ValueError("Input parameter sets must be a non-empty list.")

I propose to change this to:

if not isinstance(sets, Iterable) or len(sets) == 0:
        raise ValueError("Input parameter sets must be a non-empty iterable.")

Which then allows inputs as tuple as well, as well as ordered key-sets. Was helpful in my use case, rather than having to create a copy of the data in list form.

@ekzhu
Copy link
Owner

ekzhu commented Aug 27, 2020

Thanks for the PR. There is a problem though, consider a generator

from collections.abc import Iterable

def test():
    for i in range(len(100)):
        yield i

isinstance(test(), Iterable)
# True

len(test())
# Error

How about simplely add all the supported iterable types into the type check?

@markopy
Copy link

markopy commented Oct 9, 2020

I don't think it's possible to even support iterables with the current code because it merely keeps a reference to the input set list and requires them to be indexable. So if your original data is in an iterator a copy is needed anyway.

Arguably it's kind of dirty to require the user of the library to provide the data as a list (and take care to not modify it while the index is in use, which is not documented anywhere!) but it does save the memory overhead of creating an internal copy.

@ekzhu
Copy link
Owner

ekzhu commented Oct 11, 2020

Yes. That's why checking if input is iterator is not enough to prevent the error I mentioned.

This library is intended to provided an in-memory solution for similarity search. So I think it's fair to ask users to load all data into memory as a list, tuple, or ordered dict and keep them there. True it should be documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow 'sets' argument to be any iterable
3 participants