An imperative deep learning framework with customized GPU and CPU backend.
- To refactor the codebase
Needle adopts a monolithic design, e.g., the NDArray backend for a device is in a single file. A modular design allows agile optimization of operators, layers, modules, etc. - To extend the reservoir of operators
To support self-attention, the batched matrix multiplication operator is necessary.BatchNorm2d
can be more efficient if fused. Other customized operators include fused differentiable volume rendering, sparse matrix multiplication for graph embeddings, I/O kernels, etc. - To optimize the NDArray backend
This summary gathers a series of blog posts on maximizing the throughput of operators. Also refer to Programming Massively Parallel Processors for more topics on CUDA. The goal is to exceed the performance of official CUDA libraris like cuBLAS with hand-crafted kernels in certain tasks. - To incorporate
tcnn
as MLP intrinsics - To accelerate computational graph traversal with CUDA12
This project is inspired by 10-414/714 Deep Learning Systems by Carnegie Mellon University. Switch to the branch hw
for homework, proj
for the course project, and lec
for lectures.