Inference [WIP] #475

goliaro · 2022-11-17T01:14:51Z

Description of changes:

This PR adds support for inference. Check out the inference branch README at this link for more information.

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes #

Before merging:

Did you update the flexflow-third-party repo, if modifying any of the Cmake files, the build configs, or the submodules?

This change is

* Support multiple FFModels in a single top_level_task * [TreeVerifyMHA] bug fixes * bug fixes * TreeIncMHA and SpecIncMHA bug fixes * fomat. * . * add sentence piece tokenizer * format * prepare spec_infer demo * prettier prints * make the llama model work * add small model config * enable speculative inference for spec_infer * fix * rename * fix one of the bugs * fix * del * attempt to fix ci * integrated gpt/opt tokenizer * integrate opt tokenizer with pipeline * . * format * move files * Update README.md * add an overview figure * update images * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * add tokenizer in readme * fix * fix * fix * Update README.md * Update README.md * add gif * add weights to readme, clean some print * Update README.md * update demo * Update README.md * Update README.md * remove outdate file * Update README.md * Update README.md * . --------- Co-authored-by: xinhaoc <chengxh_98@163.com> Co-authored-by: Gabriele Oliaro <goliaro@cs.cmu.edu> Co-authored-by: xinhaoc <99570243+xinhaoc@users.noreply.github.com>

* Support multiple FFModels in a single top_level_task * [TreeVerifyMHA] bug fixes * bug fixes * TreeIncMHA and SpecIncMHA bug fixes * fomat. * . * add sentence piece tokenizer * format * prepare spec_infer demo * prettier prints * make the llama model work * add small model config * enable speculative inference for spec_infer * fix * rename * fix one of the bugs * fix * del * attempt to fix ci * integrated gpt/opt tokenizer * integrate opt tokenizer with pipeline * . * format * move files * Update README.md * add an overview figure * update images * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * add tokenizer in readme * fix * fix * fix * Update README.md * Update README.md * add gif * add weights to readme, clean some print * Update README.md * update demo * Update README.md * Update README.md * remove outdate file * Update README.md * Update README.md * . * use data parallel by default --------- Co-authored-by: xinhaoc <chengxh_98@163.com> Co-authored-by: Gabriele Oliaro <goliaro@cs.cmu.edu> Co-authored-by: xinhaoc <99570243+xinhaoc@users.noreply.github.com>

* file path adapt * fix * fix * fix

* fix hip_rocm build with sentencepiece * shellcheck 1 * shellcheck 2 * shellecheck 3 * fix install script * .github/workflows/helpers/install_dependencies.sh * fix * shellcheck * restore unnecessary changes * fix build * removed outdated test from c++ tests * update link in readme

* implemented file-based configs, remove spec_pipeline folder * fix * add inference test, script to downlaod weights * update readme * update ci scripts * newlines * fix gpu-ci * fix * fix * update test file * added incr decoding program, moved LLAMA folder from examples * linting * add incremental decoding to test * update readme * add script to download opt weights * fix support for opt, move code to root inference folder * linting * update test file * fix * bug fix * update test

* making TreeIncMultiHeadSelfAttentionMeta a subclass of IncMultiHeadSelfAttentionMeta * make BeamSearchIncMultiHeadAttentionMeta a subclass of IncMultiHeadAttentionMeta * format * merging kernel functions * merge more functions * merge compute_qkv_kernel * format * fix config --------- Co-authored-by: xinhaoc <chengxh_98@163.com>

* fix alignment bugs (part 1) * add missing file

…ttention (#737) * making TreeIncMultiHeadSelfAttentionMeta a subclass of IncMultiHeadSelfAttentionMeta * make BeamSearchIncMultiHeadAttentionMeta a subclass of IncMultiHeadAttentionMeta --------- Co-authored-by: xinhaoc <chengxh_98@163.com>

* save output to file * add alignment tests * fix * change conflicting name, add comments * fix typo * formatting * more comments and clean dead code * formatting * fixed issue with length mismatch * fix ci skip * update inf test * add precision selection support in incr decoding

* Update README.md * update readme * fix

…d tests (#749) * add support for downloading mixed precision llama/opt weights * fix * update test script to also run half precision tests * disable workflow for inference PRs * add verbose option * linting * copy opt weights in download weights script * add alignment tests with huggingface (llama) * fix, add diff to test script * fix * add opt tests * comment out tests not passing * add e2e latency to output files * add speed tests * shellcheck * shellcheck * fix * fix * linting * fix

* Add support for login information with multiple ssms. * Update prepare_next_batch_verify. * Add dedup tree merge. * Format. * Fix bugs. * Runs with mutilmodels. * Fix. * Format * Fix. * Fix increamental decoding. * fix use_full_precision issue.

* fix * fix workflow

…1318)

* . * remove deadcode * add benchmarking mode, initializing weights randomly * better logging when running out of memory * update --------- Co-authored-by: Gabriele Oliaro <goliaro@login27.chn.perlmutter.nersc.gov>

Co-authored-by: Gabriele Oliaro <goliaro@cs.cmu.edu>

* Fix incorrect innode being checked * Add example for every layer on the FFModel python class --------- Co-authored-by: Gabriele Oliaro <goliaro@cs.cmu.edu> Co-authored-by: Zhihao Jia <zhihao@cmu.edu>

… inference

* update legion version * legion version update * update legion version

* feat: fix missed compile definition * feat: add func `get_proc_mem` to process memory allocation * chore: minor * chore: try to use get_proc_mem * fix: proc_mem allocation * feat: switch to use get_proc_mem * feat: update Realm::Logger definition * fix: now all memory are allocated by get_proc_mem * chore: minor * fix: no memory allocation bugs * chore: merge file * chore: don't use ManagedMemory for now

Co-authored-by: Zhihao Jia <zhihao@cmu.edu>

* fix * . * . * fix * cleanup * fix * cleanup

* . * . * Update the default cublas behavior when CUDA_VERSION is not specified * fix bugs in IncMHA peft_bwd kernel * uncomment softmaxbackward * add layernorm to align test * add peft test scripts * fix import * fix * add code to convert peft models * add script to download peft for c++, fix bug * fix * add script to fine-tune models * implement loading lora configs/weights from file * remove peft_bwd assertion failure in embedding * fix download script * add peft dependencies in dockerfile * fix softmax backward * fix bc print indentation * Temporarily Revert "Update the default cublas behavior when CUDA_VERSION is not specified" This reverts commit 4ee710a. * Fix cublas default (#1220) * Fix Legion prebuild workflow (2) (#1208) * fix * fix * fix * fix * Fix Legion prebuild workflow (3) (#1210) * fix hip error * use CUBLAS_COMPUTE_FAST_16F for full-precision gemm --------- Co-authored-by: Zhihao Jia <zhihao@cmu.edu> * fix bugs, work on align opt-lora * update scripts * add code to output peft tensors in hf * update, fixes * linting * fix printing of tensors for numpy * update save_inference_tensors_to_file * linting * update * fix issue with save_inference_tensors_to_file * fix layer names for save_inference_tensors_to_file * fix peft * fix bwd bugs * linting * fixes * fix * fix * fix * add bc fields for peft training * linting * fix * remove ptr check * fix * implement save_operators for bwd * fix bug * implement save tensors for bwd * . * bug fix * fix * align linear * fix * bwd kernel updates * undo use of CUBLAS_COMPUTE_32F_FAST_16F for now * only send dataset entry once * update peft test scripts * loss * . * update generate/request api to take both inference and fine-tuning prompts * linting * alignment fixes in lora & linear layer * alignment fix * diagonal * fix * alignment fix ssm * sigmoid-silu-multi now fully aligned * rms norm kernel updates * fix * in-place residual rms * bug fix and linting * align backward of o_proj, attn_heads, qk_prods_softmax, and v_proj with huggingface * cleanup * finished all alignment fixes in attention backward kernel * fix * Update inc_multihead_self_attention.cu * Update inc_multihead_self_attention.cu * use grad to store peft in/output (#1241) * use grad to store peft in/output * format * . * format * enable peft request * several hacks for performance measurement; some of the changes should be reverted * Update sigmoid_silu_multi.cu * RoPE backward * PEFT bug fixes and alignment (#1269) * Revert "several hacks for performance measurement; some of the changes should be reverted" This reverts commit b9c3926. * backup * backup * updates * update * backup * backup * backup * fix * cleanup * linting * Fuse bias + relu in OPT (#1271) * fuse bias and relu in opt * fix * fix * fix * fix * Peft alignment & debugging tools (#1288) * Revert "several hacks for performance measurement; some of the changes should be reverted" This reverts commit b9c3926. * backup * backup * updates * update * backup * backup * backup * fix * cleanup * fix * fix * fix * update * simplify tensor names * fix * fixes and updates * fixes * fix * cleanup * . * restore softmax * cleanup * update alignment scripts * newline * fix legion aliasing error * fix warnings * fix * fix pipeline parallelism * fix tp issue in combine op * fix lora weight loading with tensor parallelism * fixes, implement Combine::peft_bwd_task * fix * replicate peft bwd * fixes * fix * fix combine and fwd-bwd pass dependencies * fix replicate bwd * fix * let user control amount of peft memory * only run peft_bwd if peft is enabled * fix rms norm inference region reqs * fix in-place fusion (part 1) * fix inplace fusion (part 2) * fix * disable automatic inplace rms norm for now * fix inf fusion inplace * fix rest input grads for peft without inplace residuals * fix * fix * fix residual rms * fix * fix * enable inf debugging in fusion bwd * hack to silence warning in fused bwd * fix * fix * fix build * fix * fix * add draft peft test * Peft python interface (#1306) * update script * less model renaming * fix * fix * fix * backup * . * update * . * fixes * fix * fix build * fix * fix * fix issues for downloading peft model * solved issues for download peft model * added printouts for debugging * fix * fix seg fault * add test, separate peft script in cpp * fix * fixes * fix * update peft python interface * update * update * update * updates * fix * fixes * fix * fixes --------- Co-authored-by: april-yyt <aprilytyang@gmail.com> * fix * update * fix * fix to support prompts larger than max tokens per batch * fixes to support benchmarking of finetuning throughput * many upgrades and updates related to finetuning * add ttft statistics * add warmup phase * add benchmarking code * Add scripts for evaluation with Microsoft Azure trace (#1363) * Add scripts for evaluation * Add absolute request rate value * Fix script for target arrival rate * Fix cpp req rate benchmark * update to use new dataset * Fix infinite loop * update * add data --------- Co-authored-by: Remi Delacourt <rdelacou@catalyst-0-9.eth> Co-authored-by: Gabriele Oliaro <goliaro@cs.cmu.edu> * fix * fix * add peft tests to ci * shellcheck * fix * fix python requirements * fix * fix * update ci test * update alignment doc * fix cross entropy loss bug * update alignment test * update test * add llama peft alignment test to ci * Fix values for unused params in incr_decoding * Add PEFTModelID NO_ID singleton instead of None * Fix PEFTModelID::NO_ID reference * reduce logging * fix * fix * Add peft demo * Add readme for demo * fix alignment issue * Peft optimizer (#1290) * add optimizer config, only allocate weights for training * sgd 1 * sgd 2 * update * fix * linting * . * . * fix * fix allreduce bug * update * update * add optimizer hook in hf * update * update script * . * fix * fwd * bwd * start grads * fix gradient misalignment! * update * Add support for llama3 * various fixes --------- Co-authored-by: Remi Delacourt <remi.delacourt@gmail.com> * Optimizers python interface (#1441) * python interface for optimizer * update lora linear config to support python interface * update python interface * finished lora python interface * fix * fix * update * update * more fixes * fix * initialize lora weights where needed * Add notebook * Update demo to use dataset * Fix' * Save weights after end of finetuning (#1446) * support accumulation of gradients without update * add code to save peft weights * fix * save configs * cleanup * Fully use notebook for demo * Parameterize generation and finetuning configs * Comment out inference for now * fix bug in lora inference only mode * fix * Add finetuning or inference only flags * fix * fix * fix * PEFT model upload (#1450) * upload test * fix * Make demo_class.py executable * fix * add base_model_name_or_path * fix * fix * support llama-3 tokenizer * print output tokens when not benchmarking * Use Llama3 in demo_class * Use Llama3 in demo * fix data loading for llama-3 * Add download models to demo * return/print loss at each finetuning step * fix * Adjust demo parameters * Fix for finetuning * pass finetuning losses to python interface * Update demo * Fix upload * Refactor demo * rename demo_class to demo * fix * remove epoch from loss print * Finish demo * fix test * rocm fixes * more rocm fixes * fix rocm build * docker fix * fix inference test * fix workflow * fix makefile * fix peft test * fix all-reduce issue with lora for TP scenario * fix bwd lm head * fixes * more fixes * update * fix alignment up to input ln * finished aligning all backward (tp>1) * align all peft * fix * fix broken link * formatting * fix * update * Revert "update" This reverts commit 90b2c87. * update * fix hip build * fix gpu ci * fix gpu ci * update default gpu ci version to 12.0 * update ci to 12.0 * fix * fix * update * fix * fix * update * fix * add cleanup * downgrade to cuda=11.8 --------- Co-authored-by: Gabriele Oliaro <goliaro@cs.cmu.edu> Co-authored-by: xinhaoc <chengxh_98@163.com> Co-authored-by: Xinhao Cheng <99570243+xinhaoc@users.noreply.github.com> Co-authored-by: april-yyt <aprilytyang@gmail.com> Co-authored-by: Remi <54138269+Flechman@users.noreply.github.com> Co-authored-by: Remi Delacourt <rdelacou@catalyst-0-9.eth> Co-authored-by: Rémi Delacourt <remi.delacourt@gmail.com>

goliaro force-pushed the inference branch 2 times, most recently from e8770cc to 15c8d95 Compare January 18, 2023 01:58

goliaro force-pushed the inference branch from 6cd6b67 to eaedc29 Compare February 2, 2023 04:22

goliaro and others added 26 commits May 15, 2023 11:24

add decoder for gpt tokenizer

c9b2c5d

Update README.md

16a5d02

Update README.md

b8e5586

Merge branch 'master' into inference

07cb9f0

fix make build, edit cmake

0aabf34

update std version in makefile

427d602

file path adapt (#730)

d87197d

* file path adapt * fix * fix * fix

Update README.md

b9fddec

Update README.md

dc6dcf8

Update README.md

1193b51

[Inference] - Alignment fixes (#740)

b0a5b9c

* fix alignment bugs (part 1) * add missing file

Update README.md (#741)

1ab3d80

Update README.md (#744)

6c13936

* Update README.md * update readme * fix

fix

d8072ab

Merge branch 'inference' into fix_spec

ad75ac9

Fix inference test (#767)

e131908

* fix * fix workflow

Merge branch 'inference' into fix_spec

eabad2d

lockshaw mentioned this pull request Jun 16, 2023

measure_operator_cost not implemented for op Cache_103 #268

Open

goliaro added 4 commits February 22, 2024 10:40

Add support for docker machines with cuda 12.1 and cuda 12.2 (#1308)

e24eb03

Fix NCCL tear down issue, update docker pre-build cuda version list (#…

0d75c10

…1318)

add expansion config param in specinfer

ea31426

parametrize max_spec_tree_token_num

e03dec0

lockshaw mentioned this pull request Mar 11, 2024

Is it possible to run FlexFlow only like a simulator? #1172

Open

goliaro and others added 24 commits March 13, 2024 19:46

fix

c856680

fix

8d82c91

fix

0479a64

run CI per commit only on inference branch

5bd7123

fix

e0a6e4f

fix: 'model_configs' AttributeError (#1358)

1210256

Changes to support Perlmutter environment (#1360)

b4a639c

* . * remove deadcode * add benchmarking mode, initializing weights randomly * better logging when running out of memory * update --------- Co-authored-by: Gabriele Oliaro <goliaro@login27.chn.perlmutter.nersc.gov>

update workflow to build rocm docker images

7da197e

downgrade to python 3.11 for now

002fdf0

doc: fix c++ serving example (#1372)

d54e4b6

Co-authored-by: Gabriele Oliaro <goliaro@cs.cmu.edu>

Update README.md

b90771a

Add examples for every layer in the python layer API (#1297)

385c118

* Fix incorrect innode being checked * Add example for every layer on the FFModel python class --------- Co-authored-by: Gabriele Oliaro <goliaro@cs.cmu.edu> Co-authored-by: Zhihao Jia <zhihao@cmu.edu>

add code to keep runners registered

a83effe

fix docker

4f82aae

[Tokenizer] update tokenizers-cpp repo

25fb407

Merge branch 'inference' of https://github.com/flexflow/FlexFlow into…

9e68c8c

… inference

minor bug fix (#1456)

6a1a188

update legion version (#1307)

9784b5c

* update legion version * legion version update * update legion version

pip flexflow_python typo (#1461)

6d710ac

Co-authored-by: Zhihao Jia <zhihao@cmu.edu>

update legion version

3b59f05

Fix nccl-induced segfault (#1481)

28aff70

Fix python install issue caused by new Legion version (#1482)

49523d6

* fix * . * . * fix * cleanup * fix * cleanup

jiazhihao closed this Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference [WIP] #475

Inference [WIP] #475

goliaro commented Nov 17, 2022 •

edited by lockshaw

Loading

Inference [WIP] #475

Inference [WIP] #475

Conversation

goliaro commented Nov 17, 2022 • edited by lockshaw Loading

goliaro commented Nov 17, 2022 •

edited by lockshaw

Loading