flash-attention-cuda

Final Project for CPSC 524 Parallel Programming. Danqi Liao.

This is my CUDA c implementation of the Flash Attention paper. Specifically I focus on forward pass of the attention mechanism without multi-head attention. This is a work in progress and I will be adding more features in the future.

For now, I have implemented the following:

CPU implementation of the attention mechanism
GPU naive implementation of the attention mechanism
Forward pass of Flash Attention without multi-head attention

To do (outside of the scope of this project):

Backward pass of Flash Attention without multi-head attention
Multi-head attention
Options for masking, dropout, etc.
Integration with PyTorch

Run scripts

(Each GPU attention implementation will be compared against the CPU implementation for error checking, you can comment out the CPU code if you don't want to run it)

sbatch run-standard.sh # naive GPU implementation

sbatch run-flash.sh # forward flash attention

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

flash-attention-cuda

For now, I have implemented the following:

To do (outside of the scope of this project):

Run scripts

Files

README.md

Latest commit

History

README.md

File metadata and controls

flash-attention-cuda

For now, I have implemented the following:

To do (outside of the scope of this project):

Run scripts