Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference: fix batch_size issue. #863

Merged
merged 4 commits into from
Jul 21, 2023
Merged

Conversation

xinhaoc
Copy link
Collaborator

@xinhaoc xinhaoc commented Jul 18, 2023

Description of changes:

Related Issues:

Linked Issues:

Issues closed by this PR:

Before merging:

  • Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?

@xinhaoc
Copy link
Collaborator Author

xinhaoc commented Jul 18, 2023

@lambda7xx I think our system is correct when using batch_size > 2, something I want to share about that

  1. we don't need to change the dimension when create the input tensor, my fault.
  2. before running the system with batch_size > 2. we should modify the prompt/test.json, an example is like
["Give three tips for staying healthy.", "Carnegie Mellon Univerity is located in Pittsburgh", "My favorite basketball palyer is Kobe Bryant"]
  1. let's fix any following issue you may encounter during the evaluation in this branch.

@xinhaoc xinhaoc requested a review from lambda7xx July 18, 2023 04:46
@jiazhihao jiazhihao added the inference Features and fixes related to the inference project. label Jul 18, 2023
@jiazhihao
Copy link
Collaborator

I am thinking about a more general fix where we make MAX_NUM_REQUESTS, MAX_NUM_TOKENS, and MAX_SEQ_LENGTH input arguments instead of static variables. Following is the Python interface Gabriele and I discussed

from flexflow.serve import LLM, SamplingConfig

llama = LLM.model("decapoda-research/llama-30b-hf", data_type = "half")
ssm1 = LLM.model("Jackframe/llama-160m", data_type = "half")
ssm2 = LLM.model("Jackframe/opt-160m", data_type = "half")

sampling_config = SamplingConfig(temperature = 0.9, topp = 0.8, topk = 1)

LLM.compile(llama, max_parallel_requests = xxx, max_parallel_tokens = yyy, max_seq_length = yyy, tensor_parallel_degree = 4, pipeline_parallel_degree = 2, ssms = {ssm1, ssm2})

result = llama.generate("What's the best xxx in yyy?", sampling = sampling_config)

@xinhaoc
Copy link
Collaborator Author

xinhaoc commented Jul 18, 2023

Yes, that's a good idea.

@goliaro goliaro merged commit 2ba481b into flexflow:inference Jul 21, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inference Features and fixes related to the inference project.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants