Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add a function for lazily generating integers #365

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

postelrich
Copy link

This PR adds a function for lazily generating a sample of numbers from a range of ints.

@steven-cutting
Copy link
Contributor

steven-cutting commented Mar 7, 2017

For the doc tests that are nondeterministic, use this # doctest: +SKIP marker to tell doctest not to run them.
Refer to the toolz.itertoolz.random_sample function for an example.

Also, you should add test for this function. They should go in the toolz/toolz/tests/test_itertoolz.py test file.

@postelrich
Copy link
Author

@steven-cutting fixed the tests. There are two issues. For whatever reason, the python 2 tests are failing. Not sure why thats the case as Random and range yield the same results for both 2 and 3.

The other issue is range is not lazy in python 2. What is your preference for doing either

if hasattr(__builtins__, 'xrange'):
    range = xrange
else:
    range = range

or adding the future/builtins as a dependency? This project has no dependencies so don't want to add any.

@steven-cutting
Copy link
Contributor

I should point out a couple of things. Before you put too much effort into this it would be a good idea to wait for feedback from the core maintainers (e.g. @mrocklin). They will have the final say about accepting this pull request.

Also, try to reduce the number of commits you are generating to one. Squash your commits into one. I had the same problem the first time I submitted a pull request.

As far as the issue with xrange. I would, personal, remove it entirely from the function. Instead of creating a function that only generates random integers, make it a version of random_sample that generates a guaranteed sample size. This will greatly increase the utility of the function.

Here is my suggestion for the interface:

# you can think of a better name.
def random_fixed_sample(seq, sample_size, random_state=None):
    pass

In [2]: list(random_sample(range(100, 1000), 5))
Out[2]: [447, 484, 729, 770, 819]

@steven-cutting
Copy link
Contributor

Also, a challenge with your current implementation is that each item does not have the same probability of being chosen. At the very least this should be pointed out in the documentation.
For some use cases this will be a deal breaker.

@mrocklin
Copy link
Member

mrocklin commented Mar 8, 2017

In general I agree with @steven-cutting 's assessment of the interface. One goal of toolz is not to have too many functions. If we accept randint_range then we open the door to randfloat_range, randint_infinite, etc.. Instead, it's nice to find functions that can combine nicely with other functions to produce these more specific operations. The random_sample + range randint_range is a good example of this. We could possibly take this further and say that specifying the number of outputs could be handled by take, so perhaps we just need something like the following:

>>> list(take(5, random_sequence(range(100))))
[5, 29, 59, 19, 50]

The new thing we would be adding is just the ability to generate a random stream from a fixed collection.

I'm also not really in charge of this library any more though. @eriknw probably has the final word here. Just thought I'd put forward this idea of trying to find orthogonal functions.

@postelrich
Copy link
Author

@mrocklin @steven-cutting I can agree with passing in the sequence, though you need to know the population size in order for this to work. If you pass in a generator, you can't find its size without evaluating it.

@mrocklin Also if you don't pass the sample size into random_sequence what would it return? Seems like it would just be a shuffle of the input sequence. The hope was to have the ability to generate a random set of numbers lazily.

@mrocklin
Copy link
Member

mrocklin commented Mar 8, 2017

Ah sorry, I was thinking of consuming a fixed collection. But I see now that this wasn't your objective.

I guess this opens up the question of "what is your objective?" If it is genuinely to produce a lazy sequence of integers in a particular range then I would respond that that this is too specific to be part of toolz. We endeavor to find operations that are generally useful in a wide variety of applications and operations that compose well with other operations to build more specific solutions.

This may still be out of scope for toolz, but you might want to look into reservoir sampling?

@postelrich
Copy link
Author

@mrocklin Yes this is an implementation of reservoir sampling. My reasoning for including it in the toolz library is it's a lazy method and generating random numbers seems like a generic enough of an operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants