rocking5566
released this
15 Nov 05:57
·
19 commits
to main
since this release
- Reduce LDS usage when num_splits <= 8
- Use smaller tile size to speed-up small seqlen cases
- Fine-tune block mapping
- Use larger vector size for writing workspace
- Speed-up combine kernel
- Fix block table read out-of-bound issue
- Fix wrong key/value range in each splits
- Not to access dropout seed & offset device pointer in the host api