Needle

An imperative deep learning framework with customized GPU and CPU backend.

TODO

To refactor the codebase
Needle adopts a monolithic design, e.g., the NDArray backend for a device is in a single file. A modular design allows agile optimization of operators, layers, modules, etc.
To extend the reservoir of operators
To support self-attention, the batched matrix multiplication operator is necessary. BatchNorm2d can be more efficient if fused. Other customized operators include fused differentiable volume rendering, sparse matrix multiplication for graph embeddings, I/O kernels, etc.
To optimize the NDArray backend
This summary gathers a series of blog posts on maximizing the throughput of operators. Also refer to Programming Massively Parallel Processors for more topics on CUDA. The goal is to exceed the performance of official CUDA libraris like cuBLAS with hand-crafted kernels in certain tasks.
To incorporate tcnn as MLP intrinsics
To accelerate computational graph traversal with CUDA¹²

This project is inspired by 10-414/714 Deep Learning Systems by Carnegie Mellon University. Switch to the branch hw for homework, proj for the course project, and lec for lectures.

Gunrock: a high-performance graph processing library on the GPU ↩
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU ↩