My architecture/optimizations overhaul proposal to shrink the gap with the consumer hardware #242

kabachuha · 2024-03-20T08:08:06Z

kabachuha
Mar 20, 2024

Hi, xAI!

From what I saw in its official code, it's highly sub-optimal (it uses the "vanillest" implementation of Attention, not even the Flash one!). I believe it will benefit greatly for pre-existing and recent optimizations from the NLP community. Examples include:

XFormers Flash Attention
Torch 2 Scaled dot product attention
Ring Attention (from LWM - Large World Model), to scale the context to millions of tokens
Linear "Rebased" Flash Attention - eliminates the quadratic attention cost and requires less rework than switching to Mamba, more compatible to existing architecture
Hybrid attention head sequence parallelism (e.g. FastSeq) including ZeRO (see Accelerated Transformer)
Layer offloading
Bitsandbytes/AutoGPTQ Quantization

And the best thing is that all these points (except for 1/2 and 3/4) can be combined!

With these optimizations we can hope to shrink the gap between the current requirements and the consumer/low spectrum server requirement

Can't wait to have it in the Transformers library 😬 Track

Edit: add links to resources

Feel free to give feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My architecture/optimizations overhaul proposal to shrink the gap with the consumer hardware #242

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

My architecture/optimizations overhaul proposal to shrink the gap with the consumer hardware #242

kabachuha Mar 20, 2024

Replies: 0 comments

kabachuha
Mar 20, 2024