Python interface for inference (part 2) #893

goliaro · 2023-07-28T02:55:14Z

Description of changes:

This PR introduces the Python interface for inference. It will allow the user to run FlexFlow serve as shown below. For more complete examples, check out the inference/python/incr_decoding.py and inference/python/spec_infer.py scripts.

Incremental decoding

import flexflow.serve as ff
import json

# Initialize the FlexFlow runtime. ff.init() takes a dictionary or the path to a JSON file with the configs
ff.init(
    {
        "num_gpus": 4,
        "memory_per_gpu": 14000,
        "zero_copy_memory_per_gpu": 30000,
        "pipeline_parallelism_degree": 4,
    }
)

# Create the FlexFlow LLM
llm = ff.LLM(
    "decapoda-research/llama-7b-hf",
    data_type=ff.DataType.DT_FLOAT,         # or ff.DataType.DT_HALF
    tokenizer_path="",                      # leave empty to use HF's tokenizer
    weights_path="",                        # leave empty to use HF's weights directly
    clean_cache=False,                      # set to True if you'd like to discard the FlexFlow weights/tokenizer cache for the given model
    output_file="output.txt",
)
sampling_config = ff.SamplingConfig(
    do_sample=False, temperature=0.9, topp=0.8, topk=1
)
# Compile the LLM for inference and load the weights into memory
llm.compile(
    ff.InferenceMode.INC_DECODING_MODE,
    sampling_config,
    max_batch_size=1,
    max_seq_length=256,
    max_tokens_per_batch=64,
)
# Generation begins!
prompts = [s for s in json.load(open("chatgpt.json"))]
results = llm.generate(prompts)

Speculative Inference

import flexflow.serve as ff
import os, json
from types import SimpleNamespace

# Initialize the FlexFlow runtime. ff.init() takes a dictionary or the path to a JSON file with the configs
ff.init(
    {
        "num_gpus": 4,
        "memory_per_gpu": 14000,
        "zero_copy_memory_per_gpu": 30000,
        "pipeline_parallelism_degree": 4,
    }
)

# Configure the LLM and SSM
configs = {
    "llm_model": "decapoda-research/llama-7b-hf",
    "llm_weight": "",
    "llm_tokenizer": "",
    "clean_model_cache": False,
    "full_precision": False,
    "ssms": [
        {
            "ssm_model": "JackFram/llama-160m",
            "ssm_weight": "",
            "ssm_tokenizer": "",
            "clean_model_cache": False,
            "full_precision": False,
        },
        {
            "ssm_model": "facebook/opt-125m",
            "ssm_weight": "",
            "ssm_tokenizer": "",
            "clean_model_cache": False,
            "full_precision": False,
        },
    ],
    "prompt": "../prompt/test.json",
    "output_file": "",
}
configs = SimpleNamespace(**configs)


# Create the FlexFlow LLM
ff_data_type = (
    ff.DataType.DT_FLOAT if configs.full_precision else ff.DataType.DT_HALF
)
llm = ff.LLM(
    configs.llm_model,
    data_type=ff_data_type,
    tokenizer_path=configs.llm_tokenizer,
    weights_path=configs.llm_weight,
    clean_cache=configs.clean_model_cache,
    output_file=configs.output_file,
)

# Create the SSMs
ssms = []
for ssm_config in configs.ssms:
    ssm_config = SimpleNamespace(**ssm_config)
    ff_data_type = (
        ff.DataType.DT_FLOAT if ssm_config.full_precision else ff.DataType.DT_HALF
    )
    ssm = ff.SSM(
        ssm_config.ssm_model,
        data_type=ff_data_type,
        tokenizer_path=ssm_config.ssm_tokenizer,
        weights_path=ssm_config.ssm_weight,
        clean_cache=ssm_config.clean_model_cache,
        output_file=configs.output_file,
    )
    ssms.append(ssm)

# Create the sampling configs
sampling_config = ff.SamplingConfig(
    do_sample=False, temperature=0.9, topp=0.8, topk=1
)

# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
    ssm.compile(
        ff.InferenceMode.BEAM_SEARCH_MODE,
        sampling_config,
        max_batch_size=1,
        max_seq_length=256,
        max_tokens_per_batch=64,
    )

# Compile the LLM for inference and load the weights into memory
llm.compile(
    ff.InferenceMode.TREE_VERIFY_MODE,
    sampling_config,
    max_batch_size=1,
    max_seq_length=256,
    max_tokens_per_batch=64,
    ssms=ssms,
)
# Generation begins!
prompts = [s for s in json.load(open("chatgpt.json"))]
results = llm.generate(prompts)

TODOs:

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes #

Before merging:

Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?

goliaro · 2023-08-02T03:40:19Z

The only missing part of this PR is the updating of the docs and the replacement of the C++ tests with the Python tests in CI. Everything else is working though, so if anyone is blocked by this PR, feel free to merge it. In that case, I'll open a new PR for the final polishing. Otherwise, I'll keep pushing here.

jiazhihao · 2023-08-02T03:45:41Z

Great! Let's merge this PR after it passes CI.

goliaro added 2 commits July 28, 2023 02:56

add argmax, add default args to test file

999d4ac

updates

d58c28b

goliaro force-pushed the python_inference branch from 48a96a4 to d58c28b Compare July 28, 2023 02:56

comment out print

3793ef5

goliaro added the inference Features and fixes related to the inference project. label Jul 28, 2023

goliaro added 25 commits July 28, 2023 21:19

updates

65229ff

added code to get configs and weights from hf

1ae93c0

added FileDataLoader to cffi

f9ce99c

remove aggressive reformatting

732cc5e

update

f9a3f6b

fix

7ac00b0

add code to load weights from python

e21de3b

fix half precision weight loading from python

5c8f2d2

fixed loading weights

3b1a795

fixed loading weights

6537c00

checkpoint

225f049

generation from python now works

3035099

make it easier to set flags needed to run native python

8bb5e33

downloading tokenizers from hf

b6d6bfa

add support for opt

406b548

implement falcon

0d628a1

add support for multiple prompts and prompts from json file

b0db33d

implement speculative inference

4a16fb3

finished specinfer implementation

446bcdd

Merge branch 'inference' into python_inference

37c093d

updated arguments parsing

8c9b0ea

remove unnecessary args from compile func

a63fa76

.

840e212

update interface examples

9a6e5db

fix ssm bug

f3bbefa

goliaro added 6 commits August 1, 2023 13:46

Merge branch 'inference' into python_inference

30f5c0d

fix fusion-related bugs

a07ff22

standardize argument parsing in python examples

59a3669

docstrings

135cbea

update

452b117

moved c++ inference tests

2c8eb85

goliaro marked this pull request as ready for review August 2, 2023 03:38

jiazhihao enabled auto-merge (squash) August 2, 2023 03:45

fix

9238f11

jiazhihao merged commit ba91733 into inference Aug 2, 2023
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python interface for inference (part 2) #893

Python interface for inference (part 2) #893

goliaro commented Jul 28, 2023 •

edited

Loading

goliaro commented Aug 2, 2023 •

edited

Loading

jiazhihao commented Aug 2, 2023

Python interface for inference (part 2) #893

Python interface for inference (part 2) #893

Conversation

goliaro commented Jul 28, 2023 • edited Loading

Incremental decoding

Speculative Inference

goliaro commented Aug 2, 2023 • edited Loading

jiazhihao commented Aug 2, 2023

goliaro commented Jul 28, 2023 •

edited

Loading

goliaro commented Aug 2, 2023 •

edited

Loading