Skip to content

Latest commit

 

History

History
66 lines (54 loc) · 5.62 KB

README.md

File metadata and controls

66 lines (54 loc) · 5.62 KB

cuBLASDx Library - API Examples

All example, including more advanced onces, are shipped within cuBLASDx package.

Description

This folder demonstrates cuBLASDx APIs usage.

Requirements

Build

  • You may specify CUBLASDX_CUDA_ARCHITECTURES to limit CUDA architectures used for compilation (see CMake:CUDA_ARCHITECTURES)
  • mathdx_ROOT - path to mathDx package (XX.Y - version of the package)
mkdir build && cd build
cmake -DCUBLASDX_CUDA_ARCHITECTURES=70-real -Dmathdx_ROOT=/opt/nvidia/mathdx/XX.Y ..
make
// Run
ctest

Examples

For the detailed descriptions of the examples please visit Examples section of the cuBLASDx documentation.

Group Subgroup Example Description
Introduction Examples introduction_example cuBLASDx API introduction example
Simple GEMM Examples Basic Example simple_gemm_fp32 Performs fp32 GEMM
simple_gemm_cfp16 Performs complex fp16 GEMM
simple_gemm_fp8 Performs fp8 GEMM
Extra Examples simple_gemm_leading_dimensions Performs GEMM with non-default leading dimensions
simple_gemm_std_complex_fp32 Performs GEMM with cuda::std::complex as data type
simple_gemm_mixed_precision Performs a mixed precision GEMM
simple_gemm_transform Performs GEMM with custom load and store operators
simple_gemm_custom_layout Performs GEMM with a custom user provided CuTe layout
NVRTC Examples nvrtc_gemm Performs GEMM, kernel is compiled using NVRTC
GEMM Performance single_gemm_performance Benchmark for single GEMM
fused_gemm_performance Benchmark for 2 GEMMs fused into a single kernel
Advanced Examples Fusion fused_gemm Performs 2 GEMMs in a single kernel
gemm_fft Perform GEMM and FFT in a single kernel
gemm_fft_fp16 Perform GEMM and FFT in a single kernel (half-precision complex type)
gemm_fft_performance Benchmark for GEMM and FFT fused into a single kernel
Deep Learning scaled_dot_prod_attn Scaled dot product attention using cuBLASDx
scaled_dot_prod_attn_batched Multi-head attention using cuBLASDx
Other multiblock_gemm Proof-of-concept for single large GEMM using multiple CUDA blocks
batched_gemm_fp64 Manual batching in a single CUDA block
blockdim_gemm_fp16 BLAS execution with different block dimensions