FlexLLM server demo #1510

goliaro · 2024-09-27T15:37:40Z

Description of changes:

TODOs:

Streamlit app
Chat protocol
Add LLAMA-3.1 and Llama3.2 support & check alignment (in particular, RoPe)
Support LoRA in attention projections (Attention projections (QKV, O) disaggregation #1436)
Be able to add LoRA layers at runtime, and deallocate memory when done
Be able to set parameters (e.g. max sequence length) at runtime for each request
Be able to set generation configs (top_p, temperature, etc) at runtime for each request

Related Issues:

Linked Issues:

Issues closed by this PR:

This change is

goliaro added 18 commits September 25, 2024 02:19

init

470a40f

update

7f23188

update

a2d2ac0

update

f8c90e6

update

2906e57

add max new tokens parameter

d62d9be

backup

85797e0

update

bb08d69

backup

62275c2

lora configs serialize / deserialize into single file

88d60ca

backup

e453237

.

5c8c448

.

21f8cb9

.

c5e813b

.

aa57f98

frontend

53c408c

bug fix

1691100

fixes

7ff96d7

Provide feedback