LLMA: Large Language Model Accelerator

News

[Paper Release] April, 2023: Inference with Reference: Lossless Acceleration of Large Language Models

Outputs of LLMs often have significant overlaps with some references (e.g, retrieved documents).

Lossless acceleration of LLM inference by copying from references.

Applicable to important LLM scenarios such as retrieval-augmented generation and multi-turn conversations.

2~3 times speed-up; no additional model required!