Set a default `seed` value for gen_kwargs #169

russellb · 2024-07-18T18:05:58Z

PR #137 set a seed in one case, but it turns out we could just set a default for all cases instead.

How would you feel about reverting #137 and just adding seed=42 here:
  self.defaults = {
           "model": self.ctx.model_id,
           "temperature": 0,
           "max_tokens": 4096,
       }

question:

Would there be a downside to always specifying a seed? So the pipeline author never needs to think about it?

answer from @shivchander

we can default the seed to some specific value to make things simpler. Has no effect when temp is set to 0 - so shouldnt be an issue

The text was updated successfully, but these errors were encountered:

derekhiggins · 2024-07-19T13:24:22Z

Wouldn't this make subsequent runs of synthetic data deterministic (given the same input) ? is this the behaviour desired?

markmc · 2024-07-19T16:57:35Z

Wouldn't this make subsequent runs of synthetic data deterministic (given the same input) ? is this the behaviour desired?

Great question, @derekhiggins !

Some additional context from @shivchander that came before what is quoted above:

Because all other LLMBlocks have temperature set to 0

when you set temperate=0 - this is what we call greedy sampling. The language model generates the same response during repeated calls.

But when we set a non zero temperature - we introduce stochasticity (which we want for gen_contexts coz we are asking the model to generate 10 responses - and we want these to be unique - so we set the temp to 1.0)

We are using a seed so that we can reproduce the results, in case we need to debug

I think the above does indeed miss a problem with using seed with temperature>0 with backends that do not support batching

With batching: given a seed, the server will generate a sequence of responses in a single call, and that sequence will repeatable

Without batching: given a seed, the server will generate a single, repeatable response to every call, meaning we will generate a sequence of identical samples - instead, we need to generate a sequence of random seeds (one for each request) from the seed!

In other words, something like this:

        if not self.server_supports_batched:
            seedinator = random.Random(self.gen_kwargs.get("seed", 42))
            results = []
            for prompt in prompts:
                for _ in range(self.gen_kwargs.get("n", 1)):
                    if gen_kwargs.get("temperature", 0") > 0:
                         gen_kwargs["seed"] = seedinator.randint(0, 1000)
                    response = self.ctx.client.completions.create(
                        prompt=prompt, **self.gen_kwargs
                    )
                results.append(response.choices[0].text.strip())
        return results

russellb added this to the 0.1.3 milestone Jul 18, 2024

russellb modified the milestones: 0.1.3, 0.2.0 Jul 22, 2024

markmc modified the milestones: 0.2.0, 0.2.1, 0.2.2, 0.2.3 Jul 23, 2024

nathan-weinberg added the enhancement New feature or request label Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set a default `seed` value for gen_kwargs #169

Set a default `seed` value for gen_kwargs #169

russellb commented Jul 18, 2024 •

edited

Loading

derekhiggins commented Jul 19, 2024

markmc commented Jul 19, 2024

Set a default seed value for gen_kwargs #169

Set a default seed value for gen_kwargs #169

Comments

russellb commented Jul 18, 2024 • edited Loading

derekhiggins commented Jul 19, 2024

markmc commented Jul 19, 2024

Set a default `seed` value for gen_kwargs #169

Set a default `seed` value for gen_kwargs #169

russellb commented Jul 18, 2024 •

edited

Loading