Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
benchmark on SplitTBE module with different emb-dim
Summary: # context * benchmark results w/ small batch_size ``` SplitTableBatchedEmbeddingBagsCodegen-1000-16-4-1024-10 | Runtime (P90): 0.289728 ms | Memory (P90): 1 MB SplitTableBatchedEmbeddingBagsCodegen-1000-32-4-1024-10 | Runtime (P90): 0.304992 ms | Memory (P90): 1 MB SplitTableBatchedEmbeddingBagsCodegen-1000-64-4-1024-10 | Runtime (P90): 0.315392 ms | Memory (P90): 3 MB SplitTableBatchedEmbeddingBagsCodegen-1000-128-4-1024-10 | Runtime (P90): 0.295552 ms | Memory (P90): 5 MB SplitTableBatchedEmbeddingBagsCodegen-1000-256-4-1024-10 | Runtime (P90): 0.30096 ms | Memory (P90): 9 MB SplitTableBatchedEmbeddingBagsCodegen-1000-512-4-1024-10 | Runtime (P90): 0.317056 ms | Memory (P90): 18 MB SplitTableBatchedEmbeddingBagsCodegen-1000-1024-4-1024-10 | Runtime (P90): 0.31792 ms | Memory (P90): 36 MB ``` * benchmark results w/ large batch_size ``` SplitTableBatchedEmbeddingBagsCodegen-100000-16-16-131072-10 | Runtime (P90): 0.587584 ms | Memory (P90): 0.31 GB SplitTableBatchedEmbeddingBagsCodegen-100000-32-16-131072-10 | Runtime (P90): 0.722624 ms | Memory (P90): 0.51 GB SplitTableBatchedEmbeddingBagsCodegen-100000-64-16-131072-10 | Runtime (P90): 1.29395 ms | Memory (P90): 0.92 GB SplitTableBatchedEmbeddingBagsCodegen-100000-128-16-131072-10 | Runtime (P90): 2.73472 ms | Memory (P90): 1.7 GB SplitTableBatchedEmbeddingBagsCodegen-100000-256-16-131072-10 | Runtime (P90): 6.5608 ms | Memory (P90): 3.4 GB SplitTableBatchedEmbeddingBagsCodegen-100000-512-16-131072-10 | Runtime (P90): 14.8527 ms | Memory (P90): 6.6 GB SplitTableBatchedEmbeddingBagsCodegen-100000-1024-16-131072-10 | Runtime (P90): 31.055 ms | Memory (P90): 13 GB ``` # traces * [trace files](https://fburl.com/gdrive/t8qaitoi) * batch_size - 128 {F1848010270} * batch_size - 32 {F1848011914} # conclusions * the kernel contains two major parts run on GPU: 1) a lightweighted `direct_copy_kernel_cuda`, and 2) a heavy-lifting `split_embedding_codegen_forward_unweighted_kernel`. * details of the `direct_copy_kernel_cuda` {F1848017719} * the cpu runtime for launching these two GPU kernels (~3.8ms from the traces), which is the bottleneck of this operator. Differential Revision: D62254614
- Loading branch information