Skip to content

Latest commit

 

History

History
36 lines (20 loc) · 1.09 KB

README.md

File metadata and controls

36 lines (20 loc) · 1.09 KB

flash-attention-cuda

Final Project for CPSC 524 Parallel Programming. Danqi Liao.

This is my CUDA c implementation of the Flash Attention paper. Specifically I focus on forward pass of the attention mechanism without multi-head attention. This is a work in progress and I will be adding more features in the future.

For now, I have implemented the following:

  • CPU implementation of the attention mechanism
  • GPU naive implementation of the attention mechanism
  • Forward pass of Flash Attention without multi-head attention

To do (outside of the scope of this project):

  • Backward pass of Flash Attention without multi-head attention
  • Multi-head attention
  • Options for masking, dropout, etc.
  • Integration with PyTorch

Run scripts

(Each GPU attention implementation will be compared against the CPU implementation for error checking, you can comment out the CPU code if you don't want to run it)

sbatch run-standard.sh # naive GPU implementation
sbatch run-flash.sh # forward flash attention