Skip to content
This repository has been archived by the owner on Oct 4, 2024. It is now read-only.

Latest commit

 

History

History
23 lines (15 loc) · 2.07 KB

README.md

File metadata and controls

23 lines (15 loc) · 2.07 KB

Needle

An imperative deep learning framework with customized GPU and CPU backend.

TODO

  • To refactor the codebase
    Needle adopts a monolithic design, e.g., the NDArray backend for a device is in a single file. A modular design allows agile optimization of operators, layers, modules, etc.
  • To extend the reservoir of operators
    To support self-attention, the batched matrix multiplication operator is necessary. BatchNorm2d can be more efficient if fused. Other customized operators include fused differentiable volume rendering, sparse matrix multiplication for graph embeddings, I/O kernels, etc.
  • To optimize the NDArray backend
    This summary gathers a series of blog posts on maximizing the throughput of operators. Also refer to Programming Massively Parallel Processors for more topics on CUDA. The goal is to exceed the performance of official CUDA libraris like cuBLAS with hand-crafted kernels in certain tasks.
  • To incorporate tcnn as MLP intrinsics
  • To accelerate computational graph traversal with CUDA12

Acknowledgement

This project is inspired by 10-414/714 Deep Learning Systems by Carnegie Mellon University. Switch to the branch hw for homework, proj for the course project, and lec for lectures.

Footnotes

  1. Gunrock: a high-performance graph processing library on the GPU

  2. GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU