test #53

liligwu · 2023-12-14T20:15:53Z

No description provided.

* Hipify code * Add correctness check * Revert "Add correctness check" This reverts commit a7f169dcc862e5cc8102a39eb3b7882dfa888f1b. * Fix setup.py * Add run_all.sh * Update Zipf index generation Update the Zipf index generation to generate unique indices in each bag and shuffle indices to avoid spatial locality Code reference: https://github.com/pytorch/FBGEMM/blob/7588d9d804826b428fc0e4fd418e9cc3f7a72e52/fbgemm_gpu/bench/split_table_batched_embeddings_benchmark.py#L98-L117 * Fix ROCm version check in fbgemm_gpu's setup.py * Fix hipification errors Modify code to fix hipification errors. Some ops/kernels including merge_pooled_embeddings, quantize_ops and embedding_forward_quantized_split ops are diabled currently. These ops will be enabled in the future. * Disable AVX512 for AMD CPUs AMD CPUs do not support AVX512. Thus, it has to be disabled in ROCm. * Update run_all.sh * Fix __launch_bounds__ with kWarpSize. * fix missing '#endif' in codegen/embedding_backward_code_generator.py * fix the dependencies import in setup.py * debug enum cudaMemeryAdvise * bypass the both cudaMemoryAdvise cudaMemAdvise are mapped to hipMemAdvise, in cumem_utils.cu * Build and import successfully but with NAN values. * NAN values are eliminated by bypassing res.vals[0] = hfma2( * Remove debug lines in include/fbgemm_gpu/fbgemm_cuda_utils.cuh Note: The tests of fbgemm-gpu do not pass. They will be addressed in future commits. Co-authored-by: Sarunya Pumma <sarunya.pumma@amd.com> Co-authored-by: Li Li <lili.cce.jlu@gmail.com> Co-authored-by: liligwu <li.li3@amd.com>

Rocm4.3/develop. Use SHEFL_SYNC_MACRO to replace __shefl() and __shefl_sync()

* Change hipify dependency from torch.utils.torch_hipify to hipify_torch. * add the third_party/hipify_torch to git repo

* unify function signature of jagged_xD_to_dense (pytorch#813) Summary: Pull Request resolved: pytorch#813 As title Reviewed By: jiaqizhai, jianyuh Differential Revision: D33066551 fbshipit-source-id: 8e2fd3c21f3bde67c6b20045681c2549e3583bd3 * Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D33183467 fbshipit-source-id: d7c37f3522a38e85891524c544eab4fdb01270de * Assert Tensors allocated on GPU. (pytorch#819) Summary: Pull Request resolved: pytorch#819 Check inputs for correctness wrt to GPU allocation and device. Reviewed By: jspark1105, jianyuh Differential Revision: D33167469 fbshipit-source-id: 04f638d13bde93373d64cff1428ef743300400a6 * Support batched benchmark execution and fix benchmark stats reporting (pytorch#818) Summary: Pull Request resolved: pytorch#818 As title, support multiple execution of benchmark scripts and report aggregated metric. Further, require `--bag-size` argument to conform to input data file for proper metric accounting. Reviewed By: jianyuh Differential Revision: D33182257 fbshipit-source-id: a6eeeb25646c00665b6d29df9389eddab7618d4e * Direct Convolution JIT assembly for KH=2, KW = 6 Summary: this diff has specialized codegen for convolution case where KH=2 and KW=6 ## Performance results on local devserver with AVX2 instruction: 1, 16, 16, {2, 126}, 1, {2, 6}, {1, 2}, {0, 0, 0, 0}, {1, 1}, {0, 0}, false Fbgemm baseline: 3.8 GOPS This diff: 9.2 GOPS 1, 64, 64, {2, 257}, 1, {2, 6}, {1, 2}, {0, 0, 0, 0}, {1, 1}, {0, 0}, false Fbgemm baseline: 43.8 GOPS This diff: 61.2 GOPS ## How to invoke indirect convolution function: **At offline:** 1. Weights need to be transposed to (oc/8) - (kh) - (kw) - (ic/4) - 8 - 4 2. Create the convolution function based on problem size: ``` CodeGenBase<uint8_t, int8_t, int32_t, int32_t> codeObj; CodeGenBase<uint8_t, int8_t, int32_t, int32_t>::jit_micro_kernel_fp fn; fn = codeObj.getOrCreateDirectConv<inst_set_t::avx2>( true, conv_p.OUT_DIM[1], conv_p.IN_DIM[1] * conv_p.IC, conv_p.stride[1] * conv_p.IC); ``` 3. Compute the *col_offsets* of weight tensor 4. Make sure you have allocated the space for: output tensor (Cint32_fb, Cint8_fb), and some temporary space for input rowsum ( InSum: IN_DIM[0] x IN_DIM[1], rowSum: OUT_DIM[0] x OUT_DIM[1]) **Online:** Make sure we have: conv_p ( the problem info), Aint8 (input tensor), bBuf_tr ( the transposed weight tensor), Cint32_fb ( the 32-bit results after accumulation), Cint8_fb ( the final quantized 8-bit output). // compute direct conv row sum directConvRowSum(conv_p, Aint8.data(), inSum, rowSum, row_offsets.data()); // kernel for direct convolution for (int oc = 0; oc < conv_p.OC; oc+= 8) { fn(Aint8.data(), bBuf_tr.data() + oc * kernel_dim * conv_p.IC , bBuf_tr.data(), Cint32_fb.data() + oc, conv_p.IC * conv_p.K[1], conv_p.OC); } requantizationParams_t<> reqObj = { Aint8_zero_point, // Aq_zero_point Bint8_zero_point.data(), C_zero_point, C_multiplier.data(), rowSum, // row_offsets //row_offsets.data(), col_offsets.data(), // col_offsets nullptr, // bias static_cast<std::uint32_t>(conv_p.OC), // ncols 1, // groups nullptr}; requantizeOutputProcessingAvx2<false, false, QuantizationGranularity::TENSOR, false, false>(Cint8_fb.data(), Cint32_ref.data(), {0, conv_p.OUT_DIM[1] * conv_p.OUT_DIM[0], 0, conv_p.OC}, conv_p.OC, conv_p.OC, reqObj); For more details please refer to test_asmjit2.cc Reviewed By: dskhudia Differential Revision: D31775222 fbshipit-source-id: 294450613b0978277e75d171d6a560124c14ecda * suppress errors in `deeplearning/fbgemm/fbgemm_gpu` Differential Revision: D33201593 fbshipit-source-id: 251f338e03dfde1dcc4a83c4ff9df1fe27840bdb * fix copy right header of batch_benchmark_run.py (pytorch#820) Summary: Pull Request resolved: pytorch#820 As title Reviewed By: jianyuh Differential Revision: D33213812 fbshipit-source-id: d901e87ff1047ff969c99a330aa05c8d26e1954e * Assert Tensors allocated on GPU for generated code. (pytorch#821) Summary: Pull Request resolved: pytorch#821 Check inputs for correctness wrt to GPU allocation and device. Reviewed By: jspark1105 Differential Revision: D33189944 fbshipit-source-id: 36fb5eac677466e783ef5a754c28b6d838ea09b7 * Move all fbgemm_gpu provided Python ops to fbgemm namespace from fb namespace. (pytorch#823) Summary: Pull Request resolved: pytorch#823 Reviewed By: jianyuh Differential Revision: D33147038 fbshipit-source-id: fdcb667dfb920b4f04b7d0b08082afabe7213cc1 * Implement generic HBC by feature. (pytorch#822) Summary: Pull Request resolved: pytorch#822 Implement a generic version of HBC by feature, which takes in bin_boundaries. Reviewed By: jianyuh Differential Revision: D33232676 fbshipit-source-id: 99c77f6d081fdc89699948a6c9482b8806f598a3 * Benchmark for newly added generic HBC by feature. (pytorch#826) Summary: Pull Request resolved: pytorch#826 More benchmarking for new op, and also add "double" for benchmarking type. Reviewed By: jianyuh Differential Revision: D33241845 fbshipit-source-id: 38f08f5453fd8d112ff55c046a6ac091c23bc3de * Allways set dontfork on managed Tensor + new uvm clone (pytorch#824) Summary: Pull Request resolved: pytorch#824 Workaround for S256045. UVM Tensors are unmapped from the process page table on fork (spawn). The UVM fault handler then slows down the UVM CPU<->CPU copy substantially reestablishing those mappings. The workaround sets MADV_DONTFORK on the addresses (rounded down to page size) of UVM allocations - this prevents the removal from UVM pages from the original process page table. Additionally this introduces a single threaded UVM->CPU tensor copy to 1) Avoid 8 trainers on a host to concurrently all threads with copy_ 2) Avoid high concurency in the fault handler of the uvm kernel driver. Reviewed By: jianyuh Differential Revision: D33192043 fbshipit-source-id: 094f3dcd302d455efbf4e912d58ed28756cb653f * Use kWarpSize for warp size (pytorch#827) Summary: Pull Request resolved: pytorch#827 Reviewed By: rweyrauch Differential Revision: D33271792 fbshipit-source-id: dc66b6950b37e5d92c10406a3891568a7500e26e * Move fb.embedding_bag_rowwise_prune to fbgemm_gpu OSS. (pytorch#825) Summary: Pull Request resolved: pytorch#825 Move the fb.embedding_bag_rowwise_prune op from caffe2/fb/sparsenn to fbgemm_gpu. Reviewed By: jianyuh Differential Revision: D33240318 fbshipit-source-id: 4db93a1ecd9666881779eeada1e3e493aa7525e4 * Allow optional Tensor args to be empty or on GPU. (pytorch#828) Summary: Pull Request resolved: pytorch#828 Reviewed By: jianyuh Differential Revision: D33267641 fbshipit-source-id: b193ee5b7e9ea946a20672760c320f29b217b998 * Add output_dtype to training TBE op for CPU (pytorch#829) Summary: Pull Request resolved: pytorch#829 This Diff adds `output_dtype` to `split_embedding_codegen_lookup_{{ optimizer }}_function_cpu()`. Note that the CUDA version (`split_embedding_codegen_lookup_{{ optimizer }}_function()`) already has this argument (D32399931 (pytorch@7e1183c)). Reviewed By: jianyuh Differential Revision: D32969921 fbshipit-source-id: 695e54434dc4f65f9f4c60782c60a550e38d97a7 * fix copyright header of tensor_assert_test.cpp (pytorch#831) Summary: Pull Request resolved: pytorch#831 As title Reviewed By: rweyrauch Differential Revision: D33310866 fbshipit-source-id: 1cbdee1d7c00f0e900faac570bac330866887b1c * Add permute_pooled_embedding_modules_test into RE (pytorch#830) Summary: Pull Request resolved: pytorch#830 As title Reviewed By: rweyrauch Differential Revision: D33303898 fbshipit-source-id: c94a14bc398ecb58b68ca15d7e79204233ac67d1 * Use all to one op to do DtoD between remote and merge (pytorch#817) Summary: Pull Request resolved: pytorch#817 Previously we were simply calling `Tensor.to` to launch DtoD copy. Since PyTorch is doing two-way barrier for DtoD copy, all the DtoD copies are serialized even though they are launched from different devices. See the blue DtoD copies in the graph below. {F686842812} At first I went for merge_pooled_embedding directly but I forgot that MRS models also have sequence embeddings. Covering pooled embeddings are not enough in this case. This diff introduced a function that takes in a tuple of ivalues and move the underlining tensors to a given target device then outputs a vector of ivalues with underlining tensors in the same device. For each source device, we synchronize its current stream and launch all the copies for tensors in that device. Then we synchronize the current stream on target device to wait on all the copies. Now the copies from different devices can run in parallel. {F686843333} Reviewed By: yinghai, jianyuh, houseroad Differential Revision: D33065710 fbshipit-source-id: f479fa2ea20702e14419c8b87024a87d5bbb1a68 * Add MSFP option for ads hpc model numeric emulations (pytorch#832) Summary: Pull Request resolved: pytorch#832 Add fake conversions between MSFP and fp32 in both forward and backward pass of the hpc ads model training. TODO: Add compute kernels that split the FC operator into gemms for column_blocks of activations and row_blocks of weights Reviewed By: jspark1105 Differential Revision: D30942234 fbshipit-source-id: 601d671fd00622304a50651dedffd0de3ae01ae0 * Remove benchmark CMakeLists.txt (pytorch#835) Summary: Pull Request resolved: pytorch#835 As title. This file is no longer needed after we decide to support setup.py only OSS build approach. Reviewed By: jspark1105, rweyrauch Differential Revision: D33318121 fbshipit-source-id: 4f71b23f6e9e7e78d50fab20af53cdf9f63844ad * Increase code reuse between FP32, FP16, INT8, INT4 embedding types for infer TBE (pytorch#833) Summary: Pull Request resolved: pytorch#833 We merge the implementation for {FP32, FP16, INT8, INT4} weights in inference TBE into one unified template and increase the code reuse between these implementations. This will pave the way for the future enhancements (no need to change all 4 implementations for one new feature). Reviewed By: rweyrauch Differential Revision: D33343450 fbshipit-source-id: 24e59c4a2df5ef3da353535eb879a2365293bc1f * minimize functions defined in headers (pytorch#836) Summary: Pull Request resolved: pytorch#836 We had so much stuffs that didn't need to be at header files. Split long source files. Put experimental quantization functions to experimental namespace Reviewed By: rweyrauch Differential Revision: D33358916 fbshipit-source-id: cffcec344cbe565045ee2c564ce1cef529de4cf8 * add missing C10_CUDA_KERNEL_LAUNCH_CHECK (pytorch#837) Summary: Pull Request resolved: pytorch#837 As title Reviewed By: rweyrauch Differential Revision: D33359025 fbshipit-source-id: 162dd2897a5d56e7ac8ff3ba9ae5c8689961204b * Add seq embedding kernel for infer TBE (pytorch#834) Summary: Pull Request resolved: pytorch#834 - Add sequence embedding support in infer TBE kernel - TODO: "mask" solution for the duplicated embedding row access. cc jspark1105 Reviewed By: jspark1105 Differential Revision: D33341863 fbshipit-source-id: 47babe921dbaf086e2df92f4693b4718c01bcec1 * add missing new files to CMakeLists.txt (pytorch#838) Summary: Pull Request resolved: pytorch#838 This was missed in D33358916 (pytorch@38a6c35) Reviewed By: colin2328 Differential Revision: D33370387 fbshipit-source-id: 72007f51afd6757690a1898098e8b6207c3c487b * Support int32_t indices/offsets for caching handling logics (pytorch#811) Summary: Pull Request resolved: pytorch#811 In training, we assume the indices / offsets are int64_t for embedding (TBE), but in inference, we assume the indices / offsets are int32_t. This Diff enables both int32_t and int64_t supports for the caching logics so that we can reuse the same functions for both training and inference, while reducing the extra overhead to convert the indices/offsets from int to long or vice versa. Reviewed By: jspark1105 Differential Revision: D33045589 fbshipit-source-id: 4e508a1095536a629bdab8e5577db74310032b23 * Add seq embedding benchmark Summary: 5x ~ 10x speedup in the benchmark level. Reviewed By: jspark1105 Differential Revision: D33355933 fbshipit-source-id: 2c609ae9ec5fd4fda48dbafa13b5eb75900fdf5f * fix warning count check in test_bounds_check (pytorch#839) Summary: Pull Request resolved: pytorch#839 In GPU multiple threads in a thread block can increase warning count for the same bound errors in offset array Reviewed By: jianyuh Differential Revision: D33379301 fbshipit-source-id: b00520cc613bb7e15c9f8cd4bdf0c61bd4dbd83b * fix typo in CMakeLists.txt (pytorch#840) Summary: Pull Request resolved: pytorch#840 Fixing a silly typo Reviewed By: jianyuh Differential Revision: D33380967 fbshipit-source-id: 8220cc87a2564107cb124d3f9c31b8d92cb7d1a4 * Slight perf optimization for infer TBE (pytorch#843) Summary: Pull Request resolved: pytorch#843 ~5% perf improvement for INT4 / INT8 inference TBE on A100 GPUs. Reviewed By: jspark1105 Differential Revision: D33388153 fbshipit-source-id: 63566e3dccd9ce4775abb3374251f9046512e131 * extract embedding input transpose out of embedding_backward_split_template.cu (pytorch#841) Summary: Pull Request resolved: pytorch#841 Refactoring to prepare D33381126 Other minor changes * Remove unused sorted_linear_indices_run_lengths parameter from bwd kernels Reviewed By: jianyuh Differential Revision: D33380032 fbshipit-source-id: b880cc3745a6f6dd63319109e753a470d6c28c49 * increase parallelism in batched unary embeddings backward (pytorch#842) Summary: Pull Request resolved: pytorch#842 Sort indices and have each thread handle indices with the same values (called a run in the code) Reviewed By: jianyuh Differential Revision: D33381126 fbshipit-source-id: aec1c0be619b9072f5a1f9273b66c03e5106ca02 * use DISPATCH_TO_CUDA macro (pytorch#845) Summary: Pull Request resolved: pytorch#845 We should use the macro consistently or just drop Reviewed By: jianyuh Differential Revision: D33392682 fbshipit-source-id: bd99286f55fe2d6e5bab231ec65dae02f16f35c2 * Follow-up comments (pytorch#844) Summary: Pull Request resolved: pytorch#844 Reviewed By: jspark1105 Differential Revision: D33393019 fbshipit-source-id: 1df7d8457a950a829f7ff2fe6f47595afdc9cc26 * HIP extension support for FBGEMM_GPU (pytorch#846) Summary: Pull Request resolved: pytorch#846 Reviewed By: jspark1105 Differential Revision: D33231489 fbshipit-source-id: 6bd46ddee45c767ad25c2d52b6c05030bba94082 * correct the max_shared_bytes logit evaluation logic in embedding_backward_split_template.cu * IFU from from upstream commit c6df576 to main. fbgemm-gpu is built and imported. Tests do NOT pass. Co-authored-by: Xing Liu <xingl@fb.com> Co-authored-by: CodemodService FBSourceClangFormatLinterBot <> Co-authored-by: Rick Weyrauch <weyrauch@fb.com> Co-authored-by: Martin Schatz <mschatz@fb.com> Co-authored-by: Jiyuan Zhang <jiyuanz@fb.com> Co-authored-by: Jongsoo Park <jongsoo@fb.com> Co-authored-by: Jason Park <jasonjk@fb.com> Co-authored-by: Stephan Uphoff <ups@fb.com> Co-authored-by: Jianyu Huang <jianyuhuang@fb.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: Shiyan Deng <dsy842974287@fb.com> Co-authored-by: Summer Deng <summerdeng@fb.com>

* * added skipIfRocm and TEST_WITH_ROCM in split_table_batched_embeddings_test. * added __any_sync_fbgemm that replaces __any_sync. * 26 tests ran in split_table_batched_embeddings_test 10 skipped. * *Renamed __any_sync_fbgemm to __any_sync and changed its implementation to a more generic one. *Added 'reason' message of skipIfRocm. * *enabled use_array_for_index_remapping in test_nbit_forward_int and test_nbit_forward_fp. *enabled test_nbit_forward_pruning. * deleted 'assert(false)' tthat are related to __any_sync function.

…y by diabling use_cpu.

…ounter.

…ove @skipIfRocm for TestFused8BitRowwiseQuantizationConversion and TestFusedNBitRowwiseQuantizationConversion

Enable use_cache

…eddings_test.py

Fix backward

… have been deleted in upstream.

Removed post_hipify logic in setup.py

…h_to_new_commit Pointing hipify_torch to the newer commit.

…PATH in setup.py. (#19)

* An attempt of matching upstream setup.py. * Move hipify() to CMakeList.txt. * Removing hipify from the python script. * Matching upstream setup.py * #Removing the unnecessary funcitons and statements in Hip.cmake. #Reforming some of the compilation option lists in CMakeList.txt. * Updating hipify_torch (CMake API) * #Adding automatically detection for CUDA and ROCm. #Removing the debug code in embedding_backward_code_generator.py. #Adding 'gfx90a' in FBGEMM_ROCM_ARCH. #Minor changes on message and indentation.

* Enable merge_pooled_embeddings op. in ROCm * Enabling the merge pool ops. Co-authored-by: liligwu <lili.cce.jlu@gmail.com>

====================================================================== Two tests failures: ====================================================================== ERROR: test_generic_histogram_binning_calibration_by_feature (__main__.SparseOpsTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "sparse_ops_test.py", line 1500, in test_generic_histogram_binning_calibration_by_feature data_type=st.sampled_from([torch.half, torch.float32]), File "/opt/conda/lib/python3.7/site-packages/hypothesis/core.py", line 1220, in wrapped_test raise the_error_hypothesis_found File "sparse_ops_test.py", line 1543, in test_generic_histogram_binning_calibration_by_feature bin_ctr_weight_value=0.9995, RuntimeError: expected scalar type Long but found Int ---------------------------------------------------------------------- FAIL: test_lxu_cache_lookup (__main__.SplitTableBatchedEmbeddingsTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "split_table_batched_embeddings_test.py", line 3994, in test_lxu_cache_lookup dtype=torch.int, AssertionError: False is not true ---------------------------------------------------------------------- Ran 35 tests in 759.368s FAILED (failures=1)

…CM_ARCH. # Enabling building on Pytorch 1.11.

for ROCm

Ifu 2023 03 10

* using different mechanism for host mapped pinned memory (pytorch#1638) Summary: Pull Request resolved: pytorch#1638 This diff adds another mechanism for allocating the host mapped pinned memory to reduce adverse affect on other processes running on the same host when one process is doing some large allocations. Reviewed By: zyan0, jianyuh Differential Revision: D43950253 fbshipit-source-id: 41a434cb63354509d32e00c851c5f3a2d68be686 * disable use_cpu test (pytorch#1635) Summary: This PR addresses the issue pytorch#1636 akin to https://github.com/pytorch/FBGEMM/blob/8616ed701015f8b9e4c2825ce592b204b4cfaf28/fbgemm_gpu/test/split_table_batched_embeddings_test.py#L1009 Pull Request resolved: pytorch#1635 Reviewed By: shintaro-iwasaki Differential Revision: D44033725 Pulled By: q10 fbshipit-source-id: 49f28fc2f1c20948a42728eebf3defc5195baa5d * Update API interface and reroute backend for exact_rowwise_adagrad FE when using freq based methods (pytorch#1352) Summary: Pull Request resolved: pytorch#1352 1. Update interface to accomadate rowwise_adagrad_with_counter. 2. Route backend for rowwise_adagrad to the new rowwise_adagrad_with_counter when freq based methods (e.g. freq sgd, counter adjusted regularization) are used. Reviewed By: csmiler Differential Revision: D36788395 fbshipit-source-id: 8eb5da8a5c8b52bc1e237af1054aac9f7245c443 * Remove sync point in jagged_dense_elementwise_add_jagged_output backward (pytorch#1642) Summary: Pull Request resolved: pytorch#1642 Remove sync point in jagged_dense_elementwise_add_jagged_output backward Reviewed By: brad-mengchi Differential Revision: D44039901 fbshipit-source-id: 8e7e23e4d9e01359e67e5b166adc57f894a1224d * Add Comprehensive Build Instructions and Isolate CPU and ROCm Builds (pytorch#1639) Summary: - Remove `.post0` suffix from the autogenerated package version - Document the full FBGEMM_GPU OSS build process in a separate Markdown file - Remove installation of packages not needed for ROCm builds - Migrate CPU and ROCm jobs to run on top of Docker containers instead of bare metal instances - Update GitHub workflow configuration to cancel previous jobs for a PR if a new commit is pushed to the PR Pull Request resolved: pytorch#1639 Reviewed By: shintaro-iwasaki Differential Revision: D44076312 Pulled By: q10 fbshipit-source-id: 6b2d083022feb7421b26da2d998678e00c11f283 * include cstdint (pytorch#1640) Summary: fix build with gcc-13 Pull Request resolved: pytorch#1640 Reviewed By: shintaro-iwasaki Differential Revision: D44044422 Pulled By: q10 fbshipit-source-id: 692ec9c34f4aaf726294a2b643fbceabf8159033 * Add support for group size > 54 in group_index_select (pytorch#1611) Summary: Pull Request resolved: pytorch#1611 If group size is larger than 54, internally breaks the group down into smaller groups (each subgroup size is less than or equal to 54). Reviewed By: jianyuh Differential Revision: D43585937 fbshipit-source-id: bf14eeb79881a5737dcf7660e3e0f56d21f7b326 * Implement cache miss emulation in UVM_CACHING (pytorch#1637) Summary: Pull Request resolved: pytorch#1637 Enforce cache misses (even if trace-driven testing doesn't experience cache miss due to limited trace size) so that we can evaluate performance under cache misses. Note that it's not exactly cache misses; enforce access to UVM by overriding lxu_cache_locations -- N / 256 requests. Reviewed By: YuzeDaiMeta Differential Revision: D42194019 fbshipit-source-id: ab04c1cc7a749e84d605cfe4f1687489ceab5725 * Add TensorAccessor with memcheck (pytorch#1602) Summary: Pull Request resolved: pytorch#1602 Illegal memory access is a common problem during GPU kernel execution. The FBGEMM GPU relies on PyTorch's `C10_CUDA_KERNEL_LAUNCH_CHECK()` and the CUDA runtime to detect such problems and throw an error. However, there are a few known issues with this approach. (1) `C10_CUDA_KERNEL_LAUNCH_CHECK()` detects errors on the host. However, due to the non-blocking, asynchronous nature of GPU kernel execution, the error is caught on the host at a later point than where the problematic kernel was launched. This can cause the stack trace to be inaccurate and make debugging more difficult. Although the issue can be fixed by running the code with `CUDA_LAUNCH_BLOCKING=1`, this can change the state of the execution and cause Heisenbugs. (2) Not all illegal memory accesses are caught by the runtime. This means that the system may not always throw an error when illegal memory access occurs. (3) Although the runtime throws an error for illegal memory access, it is difficult to pinpoint the specific kernel and memory buffer/address that is causing the problem. For all the aforementioned reasons, we attempt to catch and throw an error as soon as possible in the kernel when illegal memory accesses occur in FBGEMM GPU. We introduce the `FBGEMM_GPU_MEMCHECK` flag to enable memory checking during compile time. We copy PyTorch's `TensorAccessor.h` into the FBGEMM GPU and extend it to check every memory access through the `PackedTensorAccessor`. If an invalid memory access occurs, we throw an error using `CUDA_KERNEL_ASSERT`. The error message includes the name of the tensor and the kernel that caused the problem. If `FBGEMM_GPU_MEMCHECK` is enabled, FBGEMM operators will use `fbgemm::PackedTensorAccessor`. Otherwise, they will use `at::PackedTensorAccessor` `FBGEMM_GPU_MEMCHECK` integration in FBGEMM ops will be done in subsequent diffs Reviewed By: r-barnes Differential Revision: D43421838 fbshipit-source-id: c8ef04970d94bb097cb5f09b42f994db72845167 * Fix compiling with Xcode 14.3 (pytorch#1648) Summary: Pull Request resolved: pytorch#1648 This hack is not needed in Xcode 14.3 anymore, where the clang version is 14.0.3. So change the workaround to only include up to 14.0.2. Reviewed By: MatzeB Differential Revision: D44130421 fbshipit-source-id: 1fb2948567941bdf6ee9487ccfaa9dfb2caf92dd * Add support for building FBGEMM_GPU against Python 3.11 in OSS (pytorch#1646) Summary: - Parallelize the FBGEMM CI builds to build and test static and shared libraries independently instead of in serial - Move the FBGEMM CI builds to run inside Docker containers - Add support for building FBGEMM_GPU against Python 3.11 in OSS - Move all FBGEMM_GPU nightly and release build jobs to run inside `amazonlinux:2023` Docker container - Assuming no build errors or resource starvation, the full OSS build process now runs under 30 minutes. Pull Request resolved: pytorch#1646 Reviewed By: shintaro-iwasaki Differential Revision: D44157228 Pulled By: q10 fbshipit-source-id: 6403ea9955856157785c50837b0b8e4c0cd26d53 * Remove magic numbers from fbgemm/Types.h (pytorch#1629) Summary: Pull Request resolved: pytorch#1629 Replaces magic numbers with constexpr variables Reviewed By: sryap Differential Revision: D43776442 fbshipit-source-id: 5cef7566816f8730f5daa08948ee3260367787aa * added check to avoid div 0 errors in cache report (pytorch#1645) Summary: Pull Request resolved: pytorch#1645 as in title Reviewed By: jianyuh Differential Revision: D44096435 fbshipit-source-id: a7a87a14ffecc2fb6e0be74d199d385357946672 * jagged_dense_bmm operator optimization (pytorch#1643) Summary: Pull Request resolved: pytorch#1643 This diff optimizes the jagged_dense_bmm operator with the following optimizations: * tiling across thread blocks, and use GPU shared memory for thread block * tiling across threads within a thread block, and use registers for each thread Reviewed By: brad-mengchi Differential Revision: D43674845 fbshipit-source-id: 85f0abf89fa958f79636ef59c3070a1c569b73c2 * jagged_dense_bmm: fix ROCm test failures (pytorch#1655) Summary: This patch fixes test failures on AMD GPUs. 1. Remove `__restrict__ `. I don't think it is needed even for CUDA, but it confuses HIPCC. 2. Use `uint32_t` instead of `auto`: old ROCm (including ROCm <= 5.3) does not have `+=` operator for the type of `blockIdx.z`, causing a compilation error. We observed that this issue is fixed in ROCm 5.4.3, but let's use `uint32_t` for now. We should revisit and use `auto` later. See this for details: ROCm/hipamd@86a1634 Pull Request resolved: pytorch#1655 Test Plan: GitHub Actions' AMD CI Reviewed By: q10, brad-mengchi Differential Revision: D44242622 Pulled By: shintaro-iwasaki fbshipit-source-id: c9b88155ebf1ed881b2d03e3be0e8991b4b30174 * Support embedding dim 1024 ~ 2048 (pytorch#1656) Summary: Pull Request resolved: pytorch#1656 wushirong reported the failure on https://fburl.com/code/hae91ra7 . - The embedding config is from f418615450 . - `max_int8_128b_rows` is 10 --> D = 1280 Our embedding dim has grown to 1024 + ? Note that the static shared memory can only go up to 48 KB: > Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) in https://docs.nvidia.com/cuda/cuda-c-programming-guide/ for ptx shared mem error: ``` [2023-03-21T22:04:33.899-07:00] ptxas error : Entry function '_ZN4nbit60INT8_split_embedding_codegen_forward_weighted_kernel_small_LIiN3c104HalfELm2ELm4ELm4E Lm8ELm16ELb1EEEvN2at27GenericPackedTensorAccessorIhLm1ENS3_17RestrictPtrTraitsElEES6_NS4_IiLm1ES5_iEENS4_IlLm1ES5_iEENS4_IhLm1ES5_iEES7_N10fbgemm_gpu12FixedDiv isorENS4_IT_Lm1ES5_iEESD_llNS4_IfLm1ES5_iEENS4_IT0_Lm2ES5_iEENS4_IhLm2ES5_lEES7_' uses too much shared data (0x10080 bytes, 0xc000 max) ``` Currently we reduce `InputRowsInFlight` to bypass the issue (the static shared memory used in the kernel is ``` typedef uint4 AllBuffers[WarpsPerBlock][OutputRowsPerThread][InputRowsInFlight][NumUint4LoadsPerRow]; __shared__ AllBuffers buffers; ``` Long term, we can change the static shared memory to dynamic shared memory, and increase the shared memory size to be 64 KB +. Reviewed By: wushirong Differential Revision: D44270081 fbshipit-source-id: 367ae838ea073dfe58d859ea3c0e6c7190beca6a * Containerize the remaining FBGEMM_GPU CI jobs (pytorch#1658) Summary: - Containerize the remaining FBGEMM_GPU CI jobs - Add Conda cleanups to make PyTorch and CUDA installs more reliable - Update post-install checks for PyTorch to work with ROCm - Update the CI to continue running on jobs that fail on just a few variants - Use PIP to install PyTorch GPU nightly as the nightly packages show up in PIP more reliably than in Conda Pull Request resolved: pytorch#1658 Reviewed By: shintaro-iwasaki Differential Revision: D44306708 Pulled By: q10 fbshipit-source-id: 5f0862f18eca7151759d9983aa97849222539d7d * Add tbe_input_combine_with_length for GPU (pytorch#1647) Summary: Pull Request resolved: pytorch#1647 Implement `tbe_input_combine_with_length` for GPU. The operator takes 3 lists of tensors (`indices`, `lengths`, and `per_sample_weights`) and concatenates each one into a single tensor. Implicit type casting is also performed if the input types are different from the output types. `indices` and `lengths` tensors can be of type `int32_t` or `int64_t`. The outputs for `indices` concatenation and `lengths` concatenation are fixed to `int32_t`. `per_sample_weights` must be `float`. Reviewed By: bangshengtang Differential Revision: D44076452 fbshipit-source-id: f6ce8628e7345093bb55835f9523870c2914516f * jagged_jagged_bmm operator optimization (pytorch#1644) Summary: Pull Request resolved: pytorch#1644 This diff optimizes the jagged_jagged_bmm operator using tiling across thread blocks and GPU shared memory. Reviewed By: brad-mengchi Differential Revision: D44029528 fbshipit-source-id: fa5cd5a26893f935427bce5efb7dfcc731c3f47d * Specify device to emulate_cache_miss kernel (pytorch#1660) Summary: Pull Request resolved: pytorch#1660 When enabled emulate cache miss, it caused illegal memory access, if we're using more than one GPU. It turns out that previous diff didn't specify device within emulate_cache_miss kernel. This diff fixes it. In addition, cleaned up a bit (e.g., no need to used index_t based kernel launch for emulate_cache_miss kernel, as lxu_cache_locations is always with int32_t. Reviewed By: sryap, YuzeDaiMeta Differential Revision: D44340131 fbshipit-source-id: d99ba2364e9030cbca6c1166e578d24d99646bb1 * Add C++17 Support to FBGEMM and FBGEMM_GPU OSS builds (pytorch#1652) Summary: - Add C++17 support for the entire FBGEMM_GPU build - Add C++17 support for the entire FBGEMM build - Update FBGEMM tests and benchmarks to be C++17-compatible - Make FBGEMM builds output more logging - Cherry-pick code changes from D43776442 v4 now that C++17 is fully supported Pull Request resolved: pytorch#1652 Reviewed By: shintaro-iwasaki Differential Revision: D44287321 Pulled By: q10 fbshipit-source-id: 4bf2bcf66d528939865d42b6deafc470bee55d17 * Prune CPU/GPU TBE optimizer codegen (pytorch#1659) Summary: Pull Request resolved: pytorch#1659 This diff aims to reduce the build time and libary size of `//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops`. The diff modifies the build target to generate and compile only the necessary files. This is based on the fact that CPU and GPU do not support all optimizers in `SplitTBE`. (Before this diff, all optimizers were generated and compiled for both CPU and GPU.) The following is the list of supported optimizers |OptimType|Generated optimizer|Supported on CPU|Supported on GPU| |EXACT_ADAGRAD|adagrad|x|x| |EXACT_ROWWISE_ADAGRAD|rowwise_adagrad_with_counter|x|x| ||rowwise_adagrad|x|x| |EXACT_ROWWISE_WEIGHTED_ADAGRAD|rowwise_weighted_adagrad|x|x| |EXACT_SGD|sgd|x|x| |SGD|approx_sgd|x|x| |ROWWISE_ADAGRAD|approx_rowwise_adagrad_with_counter|x|| ||approx_rowwise_adagrad|x|| |ADAM|adam||x| |LAMB|lamb||x| |LARS_SGD|lars_sgd||x| |PARTIAL_ROWWISE_ADAM|partial_rowwise_adam||x| |PARTIAL_ROWWISE_LAMB|partial_rowwise_lamb||x| |-|rowwise_adagrad_with_weight_decay||| |-|approx_rowwise_adagrad_with_weight_decay||| Note: x = supported Reviewed By: jianyuh Differential Revision: D44326540 fbshipit-source-id: 02413256b4a675f13ada8e8820820cb5112cb405 * Fix the Documentation Build Job (pytorch#1673) Summary: - Rewrite the documentation builds job to use the build infrastructure tooling - Rename workflow files for consistency Pull Request resolved: pytorch#1673 Reviewed By: shintaro-iwasaki Differential Revision: D44472660 Pulled By: q10 fbshipit-source-id: 60434c1f7098b7efa8c750133bb22f14fc98d5dc * Back out "Prune CPU/GPU TBE optimizer codegen" (pytorch#1675) Summary: Pull Request resolved: pytorch#1675 Original commit changeset: 02413256b4a6 Original Phabricator Diff: D44326540 Reviewed By: q10, jianyuh Differential Revision: D44475251 fbshipit-source-id: 5be66944a833e03a2737fc6d1baaa5c351455b2c * Prepare bounds_check_indices for VBE (pytorch#1633) Summary: Pull Request resolved: pytorch#1633 Prepare `bounds_check_indices` for variable batch size TBE (VBE). - Update the frontend API to accept VBE args - Update the backend logic to process VBE data Reviewed By: jianyuh Differential Revision: D43253703 fbshipit-source-id: 2870f0c41a96265650281a9b6362d4e6dc48009b * Move pruning/index_remapping support to embedding inplace update files (pytorch#1667) Summary: Pull Request resolved: pytorch#1667 As title. This diff moves pruning/index_remapping support to embedding inplace update files. Reviewed By: jianyuh Differential Revision: D44409419 fbshipit-source-id: 93fc91d83502eb95cb0feca2a8a03b003c336078 * jagged_softmax forward optimization (pytorch#1661) Summary: Pull Request resolved: pytorch#1661 This diff optimizes jagged_softmax forward with more efficient reduction from cub library. Reviewed By: brad-mengchi Differential Revision: D44161021 fbshipit-source-id: bf2e059d14ef4d7ad311edac65155a463ba653ff * jagged_softmax backward optimization (pytorch#1662) Summary: Pull Request resolved: pytorch#1662 This diff optimizes jagged_softmax backward with more efficient reduction from cub library Reviewed By: brad-mengchi Differential Revision: D44205819 fbshipit-source-id: cd1d7a886d6ba68201dc1ad782c2e8cde7ff706b * multi-gpu all_to_one improvements (pytorch#1674) Summary: Pull Request resolved: pytorch#1674 improved multi-gpu all_to_one with: 1. new intermediate hop selection taking advantage of distinct NVLinks 2. overlapping of intermediate hop transfers with each-other and with direct-peer transfers Reviewed By: doehyun Differential Revision: D44285941 fbshipit-source-id: 0202083f04388b5ba60b8155809433f334993ef4 * Extract and export weights offsets/placements initialization functions (pytorch#1669) Summary: Pull Request resolved: pytorch#1669 Extract portions initializing the weights_placements/offsets tensors into separate functions and jit.export them. SplitState is converted to a NamedTuple since we can't jit.script a dataclass that also holds an enum. Reviewed By: houseroad Differential Revision: D44338256 fbshipit-source-id: e1c12e5956f7217d51cd190958c3764d220e521d * Fix the ROCm Test Job (pytorch#1668) Summary: - Clean up the ROCm test job and re-enable ROCm testing on the rocm instances. - Update the build scripts framework to build FBGEMM_GPU against the correct hardware target that it is intended to be tested on. One thing that was discovered was that if FBGEMM_GPU was built with `PYTORCH_ROCM_ARCH=gfx90a` but run on `gfx908` target, the tests will fail with a segfault. While the failure is expected, the segfault can be unfriendly and confusing for users. - Enable correct compilation of `merge_pooled_embeddings` operator under ROCm - Fix existing code in `jagged_tensor_ops` from PR pytorch#1661 and pytorch#1662 that break its compilation under ROCm 5.3 Pull Request resolved: pytorch#1668 Reviewed By: shintaro-iwasaki Differential Revision: D44453594 Pulled By: q10 fbshipit-source-id: 2030cd0e00c6ff9694c2783dfd62c31cf5543da2 * Use exported functions instead of calling initialize_weights in weights loading (pytorch#1676) Summary: Pull Request resolved: pytorch#1676 Export a function to reset the embedding specs by target location Reviewed By: RoshanPAN, houseroad Differential Revision: D44338258 fbshipit-source-id: 502733e9f3a164450a02656d2822492fbf69f994 * Extract index remappings array initialization and jit.export it (pytorch#1670) Summary: Pull Request resolved: pytorch#1670 ATT Reviewed By: RoshanPAN, houseroad Differential Revision: D44338257 fbshipit-source-id: c091666c7a4d294c283f5e3774d0494089fc3478 * update hipify_torch and remove the manually mapping of the C10 macros --------- Co-authored-by: Banit Agrawal <bagrawal@meta.com> Co-authored-by: Sabin Devkota <devkotasabin@meta.com> Co-authored-by: Junjie Yang <junjieyang@meta.com> Co-authored-by: Benson Ma <bensonma415@meta.com> Co-authored-by: Alfredo Tupone <tupone@gentoo.org> Co-authored-by: Sarunya Pumma <sarunya@meta.com> Co-authored-by: Doe Hyun Yoon <dhyoon@meta.com> Co-authored-by: Matt Galloway <mattjgalloway@meta.com> Co-authored-by: Richard Barnes <rbarnes@meta.com> Co-authored-by: Xiao Sun <sunx@meta.com> Co-authored-by: Rengan Xu <renganxu@meta.com> Co-authored-by: siwasaki <siwasaki@fb.com> Co-authored-by: Jianyu Huang <jianyuhuang@meta.com> Co-authored-by: Yue Dong <yoyoyod@meta.com> Co-authored-by: Geet Sethi <gsethi@meta.com> Co-authored-by: Janet Yang <qxy11@meta.com>

* using different mechanism for host mapped pinned memory (pytorch#1638) Summary: Pull Request resolved: pytorch#1638 This diff adds another mechanism for allocating the host mapped pinned memory to reduce adverse affect on other processes running on the same host when one process is doing some large allocations. Reviewed By: zyan0, jianyuh Differential Revision: D43950253 fbshipit-source-id: 41a434cb63354509d32e00c851c5f3a2d68be686 * disable use_cpu test (pytorch#1635) Summary: This PR addresses the issue pytorch#1636 akin to https://github.com/pytorch/FBGEMM/blob/8616ed701015f8b9e4c2825ce592b204b4cfaf28/fbgemm_gpu/test/split_table_batched_embeddings_test.py#L1009 Pull Request resolved: pytorch#1635 Reviewed By: shintaro-iwasaki Differential Revision: D44033725 Pulled By: q10 fbshipit-source-id: 49f28fc2f1c20948a42728eebf3defc5195baa5d * Update API interface and reroute backend for exact_rowwise_adagrad FE when using freq based methods (pytorch#1352) Summary: Pull Request resolved: pytorch#1352 1. Update interface to accomadate rowwise_adagrad_with_counter. 2. Route backend for rowwise_adagrad to the new rowwise_adagrad_with_counter when freq based methods (e.g. freq sgd, counter adjusted regularization) are used. Reviewed By: csmiler Differential Revision: D36788395 fbshipit-source-id: 8eb5da8a5c8b52bc1e237af1054aac9f7245c443 * Remove sync point in jagged_dense_elementwise_add_jagged_output backward (pytorch#1642) Summary: Pull Request resolved: pytorch#1642 Remove sync point in jagged_dense_elementwise_add_jagged_output backward Reviewed By: brad-mengchi Differential Revision: D44039901 fbshipit-source-id: 8e7e23e4d9e01359e67e5b166adc57f894a1224d * Add Comprehensive Build Instructions and Isolate CPU and ROCm Builds (pytorch#1639) Summary: - Remove `.post0` suffix from the autogenerated package version - Document the full FBGEMM_GPU OSS build process in a separate Markdown file - Remove installation of packages not needed for ROCm builds - Migrate CPU and ROCm jobs to run on top of Docker containers instead of bare metal instances - Update GitHub workflow configuration to cancel previous jobs for a PR if a new commit is pushed to the PR Pull Request resolved: pytorch#1639 Reviewed By: shintaro-iwasaki Differential Revision: D44076312 Pulled By: q10 fbshipit-source-id: 6b2d083022feb7421b26da2d998678e00c11f283 * include cstdint (pytorch#1640) Summary: fix build with gcc-13 Pull Request resolved: pytorch#1640 Reviewed By: shintaro-iwasaki Differential Revision: D44044422 Pulled By: q10 fbshipit-source-id: 692ec9c34f4aaf726294a2b643fbceabf8159033 * Add support for group size > 54 in group_index_select (pytorch#1611) Summary: Pull Request resolved: pytorch#1611 If group size is larger than 54, internally breaks the group down into smaller groups (each subgroup size is less than or equal to 54). Reviewed By: jianyuh Differential Revision: D43585937 fbshipit-source-id: bf14eeb79881a5737dcf7660e3e0f56d21f7b326 * Implement cache miss emulation in UVM_CACHING (pytorch#1637) Summary: Pull Request resolved: pytorch#1637 Enforce cache misses (even if trace-driven testing doesn't experience cache miss due to limited trace size) so that we can evaluate performance under cache misses. Note that it's not exactly cache misses; enforce access to UVM by overriding lxu_cache_locations -- N / 256 requests. Reviewed By: YuzeDaiMeta Differential Revision: D42194019 fbshipit-source-id: ab04c1cc7a749e84d605cfe4f1687489ceab5725 * Add TensorAccessor with memcheck (pytorch#1602) Summary: Pull Request resolved: pytorch#1602 Illegal memory access is a common problem during GPU kernel execution. The FBGEMM GPU relies on PyTorch's `C10_CUDA_KERNEL_LAUNCH_CHECK()` and the CUDA runtime to detect such problems and throw an error. However, there are a few known issues with this approach. (1) `C10_CUDA_KERNEL_LAUNCH_CHECK()` detects errors on the host. However, due to the non-blocking, asynchronous nature of GPU kernel execution, the error is caught on the host at a later point than where the problematic kernel was launched. This can cause the stack trace to be inaccurate and make debugging more difficult. Although the issue can be fixed by running the code with `CUDA_LAUNCH_BLOCKING=1`, this can change the state of the execution and cause Heisenbugs. (2) Not all illegal memory accesses are caught by the runtime. This means that the system may not always throw an error when illegal memory access occurs. (3) Although the runtime throws an error for illegal memory access, it is difficult to pinpoint the specific kernel and memory buffer/address that is causing the problem. For all the aforementioned reasons, we attempt to catch and throw an error as soon as possible in the kernel when illegal memory accesses occur in FBGEMM GPU. We introduce the `FBGEMM_GPU_MEMCHECK` flag to enable memory checking during compile time. We copy PyTorch's `TensorAccessor.h` into the FBGEMM GPU and extend it to check every memory access through the `PackedTensorAccessor`. If an invalid memory access occurs, we throw an error using `CUDA_KERNEL_ASSERT`. The error message includes the name of the tensor and the kernel that caused the problem. If `FBGEMM_GPU_MEMCHECK` is enabled, FBGEMM operators will use `fbgemm::PackedTensorAccessor`. Otherwise, they will use `at::PackedTensorAccessor` `FBGEMM_GPU_MEMCHECK` integration in FBGEMM ops will be done in subsequent diffs Reviewed By: r-barnes Differential Revision: D43421838 fbshipit-source-id: c8ef04970d94bb097cb5f09b42f994db72845167 * Fix compiling with Xcode 14.3 (pytorch#1648) Summary: Pull Request resolved: pytorch#1648 This hack is not needed in Xcode 14.3 anymore, where the clang version is 14.0.3. So change the workaround to only include up to 14.0.2. Reviewed By: MatzeB Differential Revision: D44130421 fbshipit-source-id: 1fb2948567941bdf6ee9487ccfaa9dfb2caf92dd * Add support for building FBGEMM_GPU against Python 3.11 in OSS (pytorch#1646) Summary: - Parallelize the FBGEMM CI builds to build and test static and shared libraries independently instead of in serial - Move the FBGEMM CI builds to run inside Docker containers - Add support for building FBGEMM_GPU against Python 3.11 in OSS - Move all FBGEMM_GPU nightly and release build jobs to run inside `amazonlinux:2023` Docker container - Assuming no build errors or resource starvation, the full OSS build process now runs under 30 minutes. Pull Request resolved: pytorch#1646 Reviewed By: shintaro-iwasaki Differential Revision: D44157228 Pulled By: q10 fbshipit-source-id: 6403ea9955856157785c50837b0b8e4c0cd26d53 * Remove magic numbers from fbgemm/Types.h (pytorch#1629) Summary: Pull Request resolved: pytorch#1629 Replaces magic numbers with constexpr variables Reviewed By: sryap Differential Revision: D43776442 fbshipit-source-id: 5cef7566816f8730f5daa08948ee3260367787aa * added check to avoid div 0 errors in cache report (pytorch#1645) Summary: Pull Request resolved: pytorch#1645 as in title Reviewed By: jianyuh Differential Revision: D44096435 fbshipit-source-id: a7a87a14ffecc2fb6e0be74d199d385357946672 * jagged_dense_bmm operator optimization (pytorch#1643) Summary: Pull Request resolved: pytorch#1643 This diff optimizes the jagged_dense_bmm operator with the following optimizations: * tiling across thread blocks, and use GPU shared memory for thread block * tiling across threads within a thread block, and use registers for each thread Reviewed By: brad-mengchi Differential Revision: D43674845 fbshipit-source-id: 85f0abf89fa958f79636ef59c3070a1c569b73c2 * jagged_dense_bmm: fix ROCm test failures (pytorch#1655) Summary: This patch fixes test failures on AMD GPUs. 1. Remove `__restrict__ `. I don't think it is needed even for CUDA, but it confuses HIPCC. 2. Use `uint32_t` instead of `auto`: old ROCm (including ROCm <= 5.3) does not have `+=` operator for the type of `blockIdx.z`, causing a compilation error. We observed that this issue is fixed in ROCm 5.4.3, but let's use `uint32_t` for now. We should revisit and use `auto` later. See this for details: ROCm/hipamd@86a1634 Pull Request resolved: pytorch#1655 Test Plan: GitHub Actions' AMD CI Reviewed By: q10, brad-mengchi Differential Revision: D44242622 Pulled By: shintaro-iwasaki fbshipit-source-id: c9b88155ebf1ed881b2d03e3be0e8991b4b30174 * Support embedding dim 1024 ~ 2048 (pytorch#1656) Summary: Pull Request resolved: pytorch#1656 wushirong reported the failure on https://fburl.com/code/hae91ra7 . - The embedding config is from f418615450 . - `max_int8_128b_rows` is 10 --> D = 1280 Our embedding dim has grown to 1024 + ? Note that the static shared memory can only go up to 48 KB: > Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) in https://docs.nvidia.com/cuda/cuda-c-programming-guide/ for ptx shared mem error: ``` [2023-03-21T22:04:33.899-07:00] ptxas error : Entry function '_ZN4nbit60INT8_split_embedding_codegen_forward_weighted_kernel_small_LIiN3c104HalfELm2ELm4ELm4E Lm8ELm16ELb1EEEvN2at27GenericPackedTensorAccessorIhLm1ENS3_17RestrictPtrTraitsElEES6_NS4_IiLm1ES5_iEENS4_IlLm1ES5_iEENS4_IhLm1ES5_iEES7_N10fbgemm_gpu12FixedDiv isorENS4_IT_Lm1ES5_iEESD_llNS4_IfLm1ES5_iEENS4_IT0_Lm2ES5_iEENS4_IhLm2ES5_lEES7_' uses too much shared data (0x10080 bytes, 0xc000 max) ``` Currently we reduce `InputRowsInFlight` to bypass the issue (the static shared memory used in the kernel is ``` typedef uint4 AllBuffers[WarpsPerBlock][OutputRowsPerThread][InputRowsInFlight][NumUint4LoadsPerRow]; __shared__ AllBuffers buffers; ``` Long term, we can change the static shared memory to dynamic shared memory, and increase the shared memory size to be 64 KB +. Reviewed By: wushirong Differential Revision: D44270081 fbshipit-source-id: 367ae838ea073dfe58d859ea3c0e6c7190beca6a * Containerize the remaining FBGEMM_GPU CI jobs (pytorch#1658) Summary: - Containerize the remaining FBGEMM_GPU CI jobs - Add Conda cleanups to make PyTorch and CUDA installs more reliable - Update post-install checks for PyTorch to work with ROCm - Update the CI to continue running on jobs that fail on just a few variants - Use PIP to install PyTorch GPU nightly as the nightly packages show up in PIP more reliably than in Conda Pull Request resolved: pytorch#1658 Reviewed By: shintaro-iwasaki Differential Revision: D44306708 Pulled By: q10 fbshipit-source-id: 5f0862f18eca7151759d9983aa97849222539d7d * Add tbe_input_combine_with_length for GPU (pytorch#1647) Summary: Pull Request resolved: pytorch#1647 Implement `tbe_input_combine_with_length` for GPU. The operator takes 3 lists of tensors (`indices`, `lengths`, and `per_sample_weights`) and concatenates each one into a single tensor. Implicit type casting is also performed if the input types are different from the output types. `indices` and `lengths` tensors can be of type `int32_t` or `int64_t`. The outputs for `indices` concatenation and `lengths` concatenation are fixed to `int32_t`. `per_sample_weights` must be `float`. Reviewed By: bangshengtang Differential Revision: D44076452 fbshipit-source-id: f6ce8628e7345093bb55835f9523870c2914516f * jagged_jagged_bmm operator optimization (pytorch#1644) Summary: Pull Request resolved: pytorch#1644 This diff optimizes the jagged_jagged_bmm operator using tiling across thread blocks and GPU shared memory. Reviewed By: brad-mengchi Differential Revision: D44029528 fbshipit-source-id: fa5cd5a26893f935427bce5efb7dfcc731c3f47d * Specify device to emulate_cache_miss kernel (pytorch#1660) Summary: Pull Request resolved: pytorch#1660 When enabled emulate cache miss, it caused illegal memory access, if we're using more than one GPU. It turns out that previous diff didn't specify device within emulate_cache_miss kernel. This diff fixes it. In addition, cleaned up a bit (e.g., no need to used index_t based kernel launch for emulate_cache_miss kernel, as lxu_cache_locations is always with int32_t. Reviewed By: sryap, YuzeDaiMeta Differential Revision: D44340131 fbshipit-source-id: d99ba2364e9030cbca6c1166e578d24d99646bb1 * Add C++17 Support to FBGEMM and FBGEMM_GPU OSS builds (pytorch#1652) Summary: - Add C++17 support for the entire FBGEMM_GPU build - Add C++17 support for the entire FBGEMM build - Update FBGEMM tests and benchmarks to be C++17-compatible - Make FBGEMM builds output more logging - Cherry-pick code changes from D43776442 v4 now that C++17 is fully supported Pull Request resolved: pytorch#1652 Reviewed By: shintaro-iwasaki Differential Revision: D44287321 Pulled By: q10 fbshipit-source-id: 4bf2bcf66d528939865d42b6deafc470bee55d17 * Prune CPU/GPU TBE optimizer codegen (pytorch#1659) Summary: Pull Request resolved: pytorch#1659 This diff aims to reduce the build time and libary size of `//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops`. The diff modifies the build target to generate and compile only the necessary files. This is based on the fact that CPU and GPU do not support all optimizers in `SplitTBE`. (Before this diff, all optimizers were generated and compiled for both CPU and GPU.) The following is the list of supported optimizers |OptimType|Generated optimizer|Supported on CPU|Supported on GPU| |EXACT_ADAGRAD|adagrad|x|x| |EXACT_ROWWISE_ADAGRAD|rowwise_adagrad_with_counter|x|x| ||rowwise_adagrad|x|x| |EXACT_ROWWISE_WEIGHTED_ADAGRAD|rowwise_weighted_adagrad|x|x| |EXACT_SGD|sgd|x|x| |SGD|approx_sgd|x|x| |ROWWISE_ADAGRAD|approx_rowwise_adagrad_with_counter|x|| ||approx_rowwise_adagrad|x|| |ADAM|adam||x| |LAMB|lamb||x| |LARS_SGD|lars_sgd||x| |PARTIAL_ROWWISE_ADAM|partial_rowwise_adam||x| |PARTIAL_ROWWISE_LAMB|partial_rowwise_lamb||x| |-|rowwise_adagrad_with_weight_decay||| |-|approx_rowwise_adagrad_with_weight_decay||| Note: x = supported Reviewed By: jianyuh Differential Revision: D44326540 fbshipit-source-id: 02413256b4a675f13ada8e8820820cb5112cb405 * Fix the Documentation Build Job (pytorch#1673) Summary: - Rewrite the documentation builds job to use the build infrastructure tooling - Rename workflow files for consistency Pull Request resolved: pytorch#1673 Reviewed By: shintaro-iwasaki Differential Revision: D44472660 Pulled By: q10 fbshipit-source-id: 60434c1f7098b7efa8c750133bb22f14fc98d5dc * Back out "Prune CPU/GPU TBE optimizer codegen" (pytorch#1675) Summary: Pull Request resolved: pytorch#1675 Original commit changeset: 02413256b4a6 Original Phabricator Diff: D44326540 Reviewed By: q10, jianyuh Differential Revision: D44475251 fbshipit-source-id: 5be66944a833e03a2737fc6d1baaa5c351455b2c * Prepare bounds_check_indices for VBE (pytorch#1633) Summary: Pull Request resolved: pytorch#1633 Prepare `bounds_check_indices` for variable batch size TBE (VBE). - Update the frontend API to accept VBE args - Update the backend logic to process VBE data Reviewed By: jianyuh Differential Revision: D43253703 fbshipit-source-id: 2870f0c41a96265650281a9b6362d4e6dc48009b * Move pruning/index_remapping support to embedding inplace update files (pytorch#1667) Summary: Pull Request resolved: pytorch#1667 As title. This diff moves pruning/index_remapping support to embedding inplace update files. Reviewed By: jianyuh Differential Revision: D44409419 fbshipit-source-id: 93fc91d83502eb95cb0feca2a8a03b003c336078 * jagged_softmax forward optimization (pytorch#1661) Summary: Pull Request resolved: pytorch#1661 This diff optimizes jagged_softmax forward with more efficient reduction from cub library. Reviewed By: brad-mengchi Differential Revision: D44161021 fbshipit-source-id: bf2e059d14ef4d7ad311edac65155a463ba653ff * jagged_softmax backward optimization (pytorch#1662) Summary: Pull Request resolved: pytorch#1662 This diff optimizes jagged_softmax backward with more efficient reduction from cub library Reviewed By: brad-mengchi Differential Revision: D44205819 fbshipit-source-id: cd1d7a886d6ba68201dc1ad782c2e8cde7ff706b * multi-gpu all_to_one improvements (pytorch#1674) Summary: Pull Request resolved: pytorch#1674 improved multi-gpu all_to_one with: 1. new intermediate hop selection taking advantage of distinct NVLinks 2. overlapping of intermediate hop transfers with each-other and with direct-peer transfers Reviewed By: doehyun Differential Revision: D44285941 fbshipit-source-id: 0202083f04388b5ba60b8155809433f334993ef4 * Extract and export weights offsets/placements initialization functions (pytorch#1669) Summary: Pull Request resolved: pytorch#1669 Extract portions initializing the weights_placements/offsets tensors into separate functions and jit.export them. SplitState is converted to a NamedTuple since we can't jit.script a dataclass that also holds an enum. Reviewed By: houseroad Differential Revision: D44338256 fbshipit-source-id: e1c12e5956f7217d51cd190958c3764d220e521d * Fix the ROCm Test Job (pytorch#1668) Summary: - Clean up the ROCm test job and re-enable ROCm testing on the rocm instances. - Update the build scripts framework to build FBGEMM_GPU against the correct hardware target that it is intended to be tested on. One thing that was discovered was that if FBGEMM_GPU was built with `PYTORCH_ROCM_ARCH=gfx90a` but run on `gfx908` target, the tests will fail with a segfault. While the failure is expected, the segfault can be unfriendly and confusing for users. - Enable correct compilation of `merge_pooled_embeddings` operator under ROCm - Fix existing code in `jagged_tensor_ops` from PR pytorch#1661 and pytorch#1662 that break its compilation under ROCm 5.3 Pull Request resolved: pytorch#1668 Reviewed By: shintaro-iwasaki Differential Revision: D44453594 Pulled By: q10 fbshipit-source-id: 2030cd0e00c6ff9694c2783dfd62c31cf5543da2 * Use exported functions instead of calling initialize_weights in weights loading (pytorch#1676) Summary: Pull Request resolved: pytorch#1676 Export a function to reset the embedding specs by target location Reviewed By: RoshanPAN, houseroad Differential Revision: D44338258 fbshipit-source-id: 502733e9f3a164450a02656d2822492fbf69f994 * Extract index remappings array initialization and jit.export it (pytorch#1670) Summary: Pull Request resolved: pytorch#1670 ATT Reviewed By: RoshanPAN, houseroad Differential Revision: D44338257 fbshipit-source-id: c091666c7a4d294c283f5e3774d0494089fc3478 * Disable COUNTER in FBGEMM test (pytorch#1683) Summary: Pull Request resolved: pytorch#1683 Disable FBGEMM test on COUNTER mode temporarily. Reviewed By: sryap Differential Revision: D44589052 fbshipit-source-id: f2af6f9e3cce75d4c599c4708055e5f52ac705e2 * update hipify_torch and remove manual mapping of C10 macros (pytorch#1682) Summary: Pull Request resolved: pytorch#1682 Reviewed By: shintaro-iwasaki Differential Revision: D44599348 Pulled By: q10 fbshipit-source-id: 8f968a7c21b09358eac070a35ee15d5b767ea94c * Install NVIDIA Drivers on Instances Missing the Drivers (pytorch#1684) Summary: - Use the pytorch/test-infra action ot install NVIDIA drivers properly if the instance is missing the drivers Pull Request resolved: pytorch#1684 Reviewed By: shintaro-iwasaki Differential Revision: D44603925 Pulled By: q10 fbshipit-source-id: 712bdf5c2af67c5a6f540567abcc47ed892912c1 * Clean up the linting job (pytorch#1686) Summary: Sumary: - Clean up the linting job to use the build scripts infrastructure - Delete the Conda prefix directory before creating a new environment, if it exists Pull Request resolved: pytorch#1686 Reviewed By: shintaro-iwasaki Differential Revision: D44646234 Pulled By: q10 fbshipit-source-id: d754efeadffb265c9e55bc302606fc1e60ef8b51 * reduce_to_one (pytorch#1571) Summary: Pull Request resolved: pytorch#1571 reduce_to_one for row-wise sharding in inference Similar approach to all_to_one but without having the source waiting for target to be ready for potential WAR and WAW dependency violation because in this reduce_to_one implementation we create a new destination tensor. Reviewed By: xing-liu, jianyuh Differential Revision: D34263436 fbshipit-source-id: 7b1630b395311cfd6fef124113436f87f51a6fba * Reorganize the build scripts (pytorch#1685) Summary: Pull Request resolved: pytorch#1685 Reviewed By: r-barnes, shintaro-iwasaki Differential Revision: D44654808 Pulled By: q10 fbshipit-source-id: a58987b4a3970139bba72db8cecc89c0256fba76 * Prune CPU/GPU TBE optimizer codegen (pytorch#1678) Summary: Pull Request resolved: pytorch#1678 This diff aims to reduce the build time and libary size of `//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops`. [1/2] Update `lookup_invoker` to enable the function invoker based on `has_cpu_support` and `has_gpu_support` [2/2] Update the code generation part The diff modifies the build target to generate and compile only the necessary files. This is based on the fact that CPU and GPU do not support all optimizers in `SplitTBE`. (Before this diff, all optimizers were generated and compiled for both CPU and GPU.) The following is the list of supported optimizers |OptimType|Generated optimizer|Supported on CPU|Supported on GPU| |EXACT_ADAGRAD|adagrad|x|x| |EXACT_ROWWISE_ADAGRAD|rowwise_adagrad_with_counter|x|x| ||rowwise_adagrad|x|x| |EXACT_ROWWISE_WEIGHTED_ADAGRAD|rowwise_weighted_adagrad|x|x| |EXACT_SGD|sgd|x|x| |SGD|approx_sgd|x|x| |ROWWISE_ADAGRAD|approx_rowwise_adagrad_with_counter|x|| ||approx_rowwise_adagrad|x|| |ADAM|adam||x| |LAMB|lamb||x| |LARS_SGD|lars_sgd||x| |PARTIAL_ROWWISE_ADAM|partial_rowwise_adam||x| |PARTIAL_ROWWISE_LAMB|partial_rowwise_lamb||x| |-|rowwise_adagrad_with_weight_decay||| |-|approx_rowwise_adagrad_with_weight_decay||| Reviewed By: q10 Differential Revision: D44484764 fbshipit-source-id: f04710e66498bdcbdad619d48411c2403316901c * thread tiling for jagged_jagged_bmm (pytorch#1691) Summary: Pull Request resolved: pytorch#1691 This diff adds thread tiling optimization in jagged_jagged_bmm operator, where each thread will process a tile of elements instead of one. The implementation is similar to the one applied to jagged_dense_bmm: D43674845. Reviewed By: brad-mengchi Differential Revision: D44764339 fbshipit-source-id: ca4cf257bac755ab97754fdc6605072cfbfb1c4d * tune the tile sizes for jagged_dense_bmm (pytorch#1692) Summary: Pull Request resolved: pytorch#1692 Tune the tile sizes based on the input tensor size. If M > N, then use larger tile size in M dimension, otherwise use larger tile size in N dimension. Reviewed By: brad-mengchi Differential Revision: D44791699 fbshipit-source-id: 348a66089d781e9fef141b63d7a56e6dfa5da905 * Populate supported optims to match OSS Pytorch state dict (pytorch#1632) Summary: Pull Request resolved: pytorch#1632 ATT. Reviewed By: jianyuh Differential Revision: D43887969 fbshipit-source-id: 048ff61a925113b29c547abf20d7acdc4a50b8d7 * Build Scripts and README Improvements (pytorch#1695) Summary: - Update build scripts to print out cc, c++, and nvcc preprocessor defines - Print out all undefined symbols in the output library after build to inspect whether or not templates have been un-instantiated - Handle the case where `TORCH_CUDA_ARCH_LIST` is pre-defined in the environment - Clean up the FBGEMM_GPU READMEs to consolidate all FBGEMM_GPU build instructions into `docs/BuildInstructions.md` - Fix the build badges for FBGEMM and FBGEMM_GPU - Add Slack contact information to the READMEs - Remove deprecated GitHub workflows and build scripts in favor of the new scripts, which cover all the functionality of the old scripts Pull Request resolved: pytorch#1695 Reviewed By: shintaro-iwasaki Differential Revision: D44901368 Pulled By: q10 fbshipit-source-id: bef6045347c905a051970e4e5f8630175e0f5ef6 * Add Documentation to Work Around GCC 12 Regressions (pytorch#1697) Summary: Pull Request resolved: pytorch#1697 Reviewed By: shintaro-iwasaki Differential Revision: D44935915 Pulled By: q10 fbshipit-source-id: e1bdd4ebff18bd9708208a5b659ef9a93ebc866a * Fix build instructions (pytorch#1701) Summary: This change fixes a missing step (cd) in the build instructions. Pull Request resolved: pytorch#1701 Reviewed By: sryap Differential Revision: D45011147 Pulled By: q10 fbshipit-source-id: 704ce5bd3cfbd62c31f434c830a7300e5d645024 * Fix a build error from -Wno-unused-but-set-variable (pytorch#1702) Summary: This project is compiled with -Wall and -Werror (see pytorch#868) and is throwing an error for the unused variable here. This code appears to be debugging code that was used to verify that the function it's contained in was originally implemented properly so the most straightforward solution is to just remove it. Pull Request resolved: pytorch#1702 Reviewed By: sryap Differential Revision: D45011174 Pulled By: q10 fbshipit-source-id: 2c252cfa6063789371f5fba5f642c2f4fb72455f * Fix exception in QuantUtilsTest (pytorch#1703) Summary: This test mistakenly calls reserve() to set a vector's length instead of resize(). reserve() allocates memory for the specified number of elements, but does not actually increase the number of elements that can legally be stored in the vector. This test runs with ASAN enabled which is catching this illegal access and causing the test to fail. This change fixes the code to instead call resize(); the test now passes. Pull Request resolved: pytorch#1703 Reviewed By: sryap Differential Revision: D45011317 Pulled By: q10 fbshipit-source-id: 2840d7bfcfb46ca1523f55e77a3834a1d561c045 * Support EXACT_ADAGRAD in `get_optimizer_state` (pytorch#1700) Summary: Pull Request resolved: pytorch#1700 This diff support `get_optimizer_state` for exact_adagrad. Exact_adagrad is not supported in `get_optimizer_state`. However, this is needed for creating fused optimizer in torchrec. Reviewed By: r-barnes Differential Revision: D44963975 fbshipit-source-id: e2f523dfc1e1d17a4925e7ce4a9e65829f1cf1b0 * Split the Rendering of `embedding_forward_quantized_split_template.cu` into Smaller Files (pytorch#1694) Summary: `embedding_forward_quantized_split_template.cu` is a very large jinja template that renders 30+ C++ templates, which are then instantiated to over 600+ kernel functions. There are three sets of jinja templates in `embedding_forward_quantized_split_template.cu`: those related to `int_nbit_split_embedding_*`, `pruned_hashmap_lookup_*` and `pruned_array_lookup_*`.. Currently, the rendering produces a single file, which takes a large amount of time to compile. This PR does two things at a high level. First, it breaks up the jinja template into multiple jinja templates. Then, it forces each of these smaller jinja templates to render multiple source files instead of a single source file. This change will enable build parallelization and overall build time savings. Details: - Port improvements to `embedding_forward_quantized_split_template.cu` from D44707812 - Move the non-jinja-template code inside `embedding_forward_quantized_split_template.cu` over to `embedding_forward_template_helpers.cuh` - Move `pruned_hashmap_lookup_*` and `pruned_array_lookup_*` sets of jinja templates out to non-jinja-template `embedding_forward_quantized_split_lookup.cu`, since the template generated functions are redundant. - Break the `int_nbit_split_embedding_*` set of jinja templates into two files, one for rendering kernel-side code (`embedding_forward_quantized_split_nbit_kernel_template.cu`) and the other for rendering host-side code (`embedding_forward_quantized_split_nbit_host_template.cu`) - For the `int_nbit_split_embedding_*` host-side jinja template, make it render `weighted`, `unweighted`, and `unweighted nobag` variants into separate source files - For the `int_nbit_split_embedding_*` kernel-side jinja template, make it render into N = [`weighted`, `unweighted`, and `unweighted nobag` variants ] x [ 6 embedding types ] separate source files, each containing a single C++ template kernel function. Also generate the code to explicitly instantiate the kernel templates. For each of the C++ templates being generated, there will be 2 {device-only bool} x [3-4] (output types) x [3-5] (cases) = 18-40 actual template instantiations - To help with debugging missing template instantiations, print out all undefined symbols in the output library after build to inspect whether or not templates have been un-instantiated - Update build scripts to print out `cc`, `c++`, and `nvcc` preprocessor defines - Handle the case where `TORCH_CUDA_ARCH_LIST` is pre-defined in the environment Pull Request resolved: pytorch#1694 Reviewed By: sryap, r-barnes Differential Revision: D44842524 Pulled By: q10 fbshipit-source-id: 96f92e40ab2fec598aeb8c483e94997ac050aae7 * Back out "Prune CPU/GPU TBE optimizer codegen" (pytorch#1706) Summary: Pull Request resolved: pytorch#1706 Original commit changeset: f04710e66498 Original Phabricator Diff: D44484764 Reviewed By: q10, brad-mengchi, jianyuh, shintaro-iwasaki Differential Revision: D45054051 fbshipit-source-id: 9d14504c76eb93b2f1b14f4c2ec4c5b807c7fc4a * Use CUB kernel for 2D asynchronous_complete_cumsum (pytorch#1707) Summary: Pull Request resolved: pytorch#1707 Temporarily use the CUB kernel instead of the custom kernel for 2D `asynchronous_complete_cumsum` Reviewed By: q10, brad-mengchi, jianyuh Differential Revision: D45062784 fbshipit-source-id: cebe3992ff8ebec9c0f554e729b8d79a1eced1de * Split the Code Generation for `embedding_backward_split_template.cu` into Smaller Files (pytorch#1705) Summary: `embedding_backward_split_template.cu` contains both jinja-template and non-jinja-template code, and some of the templating is unneccessary. Furthermore, the template generates both the vanilla and `nobag` variants of unweighted into the same source file. This PR moves the non-jinja-template code out of the template, de-duplicates code are unneccessarily templated, and splits the generation of the code to three files per optimizer, one for `weighted`, `unweighted nobag`, and `unweighted`. Details: - Migrate non-jinja-templated code out of `embedding_backward_split_template.cu` and into `embedding_backward_template_helpers.cuh` - De-templatize `split_embedding_backward_codegen_{{ optimizer }}_{{ wdesc }}_find_long_segments` into `split_embedding_backward_codegen_find_long_segments` since there is no implementation difference between the optimizers and weighted vs unweighted - Migrate `grad_mean_kernel` and `split_embedding_backward_codegen_find_long_segments` into a separate non-template source file to de-duplicate code generation and compilation - Split the code generation of `embedding_backward_split_template.cu` into 3 files per optimizer, according to weighted, unweighted_nobag, and unweighted Pull Request resolved: pytorch#1705 Reviewed By: sryap Differential Revision: D45073273 Pulled By: q10 fbshipit-source-id: e82ea643f8e67ad5aa0b3de03562532c5735453d * Add jagged slice op for cpu (pytorch#1690) Summary: Pull Request resolved: pytorch#1690 The context why this is needed is as follows 1) For really long sparse features we want to split them into multiple chunks that can be fed into the model 2) Slicing requires users to require per row start point & a maximum L. Based on these requirements, a custom op mimicing the slice semantics of a normal tensor works best. An example usage using pseudo code ``` input_jagged_tensor = [[1, 2, 3, 4], [1, 2, 3], [1, 2, 3, 4, 5, 6], [1], [1, 2]] start = [0, 0, 0, 0, 0] slice_length = 3 >> jagged_slice(input_jagged_tensor, start, slice_length) output_jagged_tensor = [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1], [1, 2]] ``` A corresponding operation for dense tensor would look like ``` dense_tensor = torch.randn((8, 10)) slice_dense_tensor = dense_tensor[:, 1:3] ``` Reviewed By: sryap Differential Revision: D44299744 fbshipit-source-id: 44996f2f2ec5fc5f31dda4cb3bd8f0241497df66 * Move radix sort to common utilities and add the possibility to handle negative integers (pytorch#1672) Summary: Move the `radix_sort` implementation to common utilities, so it can be used in PyTorch in case it was not built with FBGEMM GPU. Add the possibility to handle negative integers, which is crucial for reusing `radix_sort` in PyTorch's `sort` operation. Details: This PR addresses two issues: 1. `radix_sort` is currently used in [scatter_reduce](https://github.com/dszwicht/pytorch/blob/master/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp#L630) (please view this [comment](https://github.com/pytorch/pytorch/pull/82703/files#r1045360609) for more information). Till now `radix_sort` was under `fbgemm_gpu` subproject. It means that implementation was not available in PyTorch in case it was built for CPU - that's why `radix_sort` was copy pasted under aten directory in PyTorch. This PR moves `radix_sort` implementation to common utilities. 2. In GNN workloads we often sort 1D integer data with non-negative values, for example, when converting CSR to CSC format. Unfortunately, `torch.sort` for 1D data works sequentially. `radix_sort` seems to be a perfect match to accelerate described case. However, suppose we want to do that on the PyTorch site. In that case, we have to either fallback to a regular path after detecting negative numbers in the tensor or perform post-processing, by swapping positive and negative blocks of data (data like `[2, -1, -2, 1]` after sorting will be in the following form `[1, 2, -2, -1]`, due to the fact of how numbers are stored). Both these solutions are not elegant. As an alternative, I propose the extension of `radix_sort` algorithm, by giving it capability to work with negative numbers. This can be enabled by passing an optional parameter, `maybe_with_neg_vals`. If set to `true`, we will perform all passes (up to the most significant sign bit) and apply a special prefix sum combination in the last pass. An example of how we can reuse fbgemm in PyTorch can be found in my private fork, [here](DamianSzwichtenberg/pytorch#2) (I also provide speedup data). The above changes have several consequences: 1. `TORCH_CHECK` was replaced with `assert` as fbgemm CPU does not have PyTorch in its dependencies. 2. `__builtin_clz` was replaced with manual implementation as `__builtin_clz` is not portable. Additional information for reviewers: I did perform benchmarks of `radix_sort` before and after my code modification. I didn't observe any performance drop. Pull Request resolved: pytorch#1672 Reviewed By: sryap Differential Revision: D44616959 Pulled By: q10 fbshipit-source-id: f34594478c94ec6610c05545feb2044b58d79d66 * Daily `arc lint --take CLANGFORMAT` Reviewed By: bigfootjon Differential Revision: D45141964 fbshipit-source-id: 58308a31522a3b1446835e358a93483b611c4b15 --------- Co-authored-by: Banit Agrawal <bagrawal@meta.com> Co-authored-by: Sabin Devkota <devkotasabin@meta.com> Co-authored-by: Junjie Yang <junjieyang@meta.com> Co-authored-by: Benson Ma <bensonma415@meta.com> Co-authored-by: Alfredo Tupone <tupone@gentoo.org> Co-authored-by: Sarunya Pumma <sarunya@meta.com> Co-authored-by: Doe Hyun Yoon <dhyoon@meta.com> Co-authored-by: Matt Galloway <mattjgalloway@meta.com> Co-authored-by: Richard Barnes <rbarnes@meta.com> Co-authored-by: Xiao Sun <sunx@meta.com> Co-authored-by: Rengan Xu <renganxu@meta.com> Co-authored-by: siwasaki <siwasaki@fb.com> Co-authored-by: Jianyu Huang <jianyuhuang@meta.com> Co-authored-by: Yue Dong <yoyoyod@meta.com> Co-authored-by: Geet Sethi <gsethi@meta.com> Co-authored-by: Janet Yang <qxy11@meta.com> Co-authored-by: Wang Zhou <wangzhou@meta.com> Co-authored-by: Jongsoo Park <jongsoo@meta.com> Co-authored-by: Tran Le <quytranle@meta.com> Co-authored-by: Ryan Landay <rlanday@gmail.com> Co-authored-by: Devashish Tyagi <devashisht@meta.com> Co-authored-by: Szwichtenberg, Damian <damian.szwichtenberg@intel.com> Co-authored-by: generatedunixname89002005325676 <generatedunixname89002005325676@fb.com>

* using different mechanism for host mapped pinned memory (#1638) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1638 This diff adds another mechanism for allocating the host mapped pinned memory to reduce adverse affect on other processes running on the same host when one process is doing some large allocations. Reviewed By: zyan0, jianyuh Differential Revision: D43950253 fbshipit-source-id: 41a434cb63354509d32e00c851c5f3a2d68be686 * disable use_cpu test (#1635) Summary: This PR addresses the issue https://github.com/pytorch/FBGEMM/issues/1636 akin to https://github.com/pytorch/FBGEMM/blob/8616ed701015f8b9e4c2825ce592b204b4cfaf28/fbgemm_gpu/test/split_table_batched_embeddings_test.py#L1009 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1635 Reviewed By: shintaro-iwasaki Differential Revision: D44033725 Pulled By: q10 fbshipit-source-id: 49f28fc2f1c20948a42728eebf3defc5195baa5d * Update API interface and reroute backend for exact_rowwise_adagrad FE when using freq based methods (#1352) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1352 1. Update interface to accomadate rowwise_adagrad_with_counter. 2. Route backend for rowwise_adagrad to the new rowwise_adagrad_with_counter when freq based methods (e.g. freq sgd, counter adjusted regularization) are used. Reviewed By: csmiler Differential Revision: D36788395 fbshipit-source-id: 8eb5da8a5c8b52bc1e237af1054aac9f7245c443 * Remove sync point in jagged_dense_elementwise_add_jagged_output backward (#1642) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1642 Remove sync point in jagged_dense_elementwise_add_jagged_output backward Reviewed By: brad-mengchi Differential Revision: D44039901 fbshipit-source-id: 8e7e23e4d9e01359e67e5b166adc57f894a1224d * Add Comprehensive Build Instructions and Isolate CPU and ROCm Builds (#1639) Summary: - Remove `.post0` suffix from the autogenerated package version - Document the full FBGEMM_GPU OSS build process in a separate Markdown file - Remove installation of packages not needed for ROCm builds - Migrate CPU and ROCm jobs to run on top of Docker containers instead of bare metal instances - Update GitHub workflow configuration to cancel previous jobs for a PR if a new commit is pushed to the PR Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1639 Reviewed By: shintaro-iwasaki Differential Revision: D44076312 Pulled By: q10 fbshipit-source-id: 6b2d083022feb7421b26da2d998678e00c11f283 * include cstdint (#1640) Summary: fix build with gcc-13 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1640 Reviewed By: shintaro-iwasaki Differential Revision: D44044422 Pulled By: q10 fbshipit-source-id: 692ec9c34f4aaf726294a2b643fbceabf8159033 * Add support for group size > 54 in group_index_select (#1611) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1611 If group size is larger than 54, internally breaks the group down into smaller groups (each subgroup size is less than or equal to 54). Reviewed By: jianyuh Differential Revision: D43585937 fbshipit-source-id: bf14eeb79881a5737dcf7660e3e0f56d21f7b326 * Implement cache miss emulation in UVM_CACHING (#1637) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1637 Enforce cache misses (even if trace-driven testing doesn't experience cache miss due to limited trace size) so that we can evaluate performance under cache misses. Note that it's not exactly cache misses; enforce access to UVM by overriding lxu_cache_locations -- N / 256 requests. Reviewed By: YuzeDaiMeta Differential Revision: D42194019 fbshipit-source-id: ab04c1cc7a749e84d605cfe4f1687489ceab5725 * Add TensorAccessor with memcheck (#1602) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1602 Illegal memory access is a common problem during GPU kernel execution. The FBGEMM GPU relies on PyTorch's `C10_CUDA_KERNEL_LAUNCH_CHECK()` and the CUDA runtime to detect such problems and throw an error. However, there are a few known issues with this approach. (1) `C10_CUDA_KERNEL_LAUNCH_CHECK()` detects errors on the host. However, due to the non-blocking, asynchronous nature of GPU kernel execution, the error is caught on the host at a later point than where the problematic kernel was launched. This can cause the stack trace to be inaccurate and make debugging more difficult. Although the issue can be fixed by running the code with `CUDA_LAUNCH_BLOCKING=1`, this can change the state of the execution and cause Heisenbugs. (2) Not all illegal memory accesses are caught by the runtime. This means that the system may not always throw an error when illegal memory access occurs. (3) Although the runtime throws an error for illegal memory access, it is difficult to pinpoint the specific kernel and memory buffer/address that is causing the problem. For all the aforementioned reasons, we attempt to catch and throw an error as soon as possible in the kernel when illegal memory accesses occur in FBGEMM GPU. We introduce the `FBGEMM_GPU_MEMCHECK` flag to enable memory checking during compile time. We copy PyTorch's `TensorAccessor.h` into the FBGEMM GPU and extend it to check every memory access through the `PackedTensorAccessor`. If an invalid memory access occurs, we throw an error using `CUDA_KERNEL_ASSERT`. The error message includes the name of the tensor and the kernel that caused the problem. If `FBGEMM_GPU_MEMCHECK` is enabled, FBGEMM operators will use `fbgemm::PackedTensorAccessor`. Otherwise, they will use `at::PackedTensorAccessor` `FBGEMM_GPU_MEMCHECK` integration in FBGEMM ops will be done in subsequent diffs Reviewed By: r-barnes Differential Revision: D43421838 fbshipit-source-id: c8ef04970d94bb097cb5f09b42f994db72845167 * Fix compiling with Xcode 14.3 (#1648) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1648 This hack is not needed in Xcode 14.3 anymore, where the clang version is 14.0.3. So change the workaround to only include up to 14.0.2. Reviewed By: MatzeB Differential Revision: D44130421 fbshipit-source-id: 1fb2948567941bdf6ee9487ccfaa9dfb2caf92dd * Add support for building FBGEMM_GPU against Python 3.11 in OSS (#1646) Summary: - Parallelize the FBGEMM CI builds to build and test static and shared libraries independently instead of in serial - Move the FBGEMM CI builds to run inside Docker containers - Add support for building FBGEMM_GPU against Python 3.11 in OSS - Move all FBGEMM_GPU nightly and release build jobs to run inside `amazonlinux:2023` Docker container - Assuming no build errors or resource starvation, the full OSS build process now runs under 30 minutes. Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1646 Reviewed By: shintaro-iwasaki Differential Revision: D44157228 Pulled By: q10 fbshipit-source-id: 6403ea9955856157785c50837b0b8e4c0cd26d53 * Remove magic numbers from fbgemm/Types.h (#1629) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1629 Replaces magic numbers with constexpr variables Reviewed By: sryap Differential Revision: D43776442 fbshipit-source-id: 5cef7566816f8730f5daa08948ee3260367787aa * added check to avoid div 0 errors in cache report (#1645) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1645 as in title Reviewed By: jianyuh Differential Revision: D44096435 fbshipit-source-id: a7a87a14ffecc2fb6e0be74d199d385357946672 * jagged_dense_bmm operator optimization (#1643) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1643 This diff optimizes the jagged_dense_bmm operator with the following optimizations: * tiling across thread blocks, and use GPU shared memory for thread block * tiling across threads within a thread block, and use registers for each thread Reviewed By: brad-mengchi Differential Revision: D43674845 fbshipit-source-id: 85f0abf89fa958f79636ef59c3070a1c569b73c2 * jagged_dense_bmm: fix ROCm test failures (#1655) Summary: This patch fixes test failures on AMD GPUs. 1. Remove `__restrict__ `. I don't think it is needed even for CUDA, but it confuses HIPCC. 2. Use `uint32_t` instead of `auto`: old ROCm (including ROCm <= 5.3) does not have `+=` operator for the type of `blockIdx.z`, causing a compilation error. We observed that this issue is fixed in ROCm 5.4.3, but let's use `uint32_t` for now. We should revisit and use `auto` later. See this for details: https://github.com/ROCm-Developer-Tools/hipamd/commit/86a1634c642daeda1e984d4124bcc2aeba5c4e19 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1655 Test Plan: GitHub Actions' AMD CI Reviewed By: q10, brad-mengchi Differential Revision: D44242622 Pulled By: shintaro-iwasaki fbshipit-source-id: c9b88155ebf1ed881b2d03e3be0e8991b4b30174 * Support embedding dim 1024 ~ 2048 (#1656) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1656 wushirong reported the failure on https://fburl.com/code/hae91ra7 . - The embedding config is from f418615450 . - `max_int8_128b_rows` is 10 --> D = 1280 Our embedding dim has grown to 1024 + ? Note that the static shared memory can only go up to 48 KB: > Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) in https://docs.nvidia.com/cuda/cuda-c-programming-guide/ for ptx shared mem error: ``` [2023-03-21T22:04:33.899-07:00] ptxas error : Entry function '_ZN4nbit60INT8_split_embedding_codegen_forward_weighted_kernel_small_LIiN3c104HalfELm2ELm4ELm4E Lm8ELm16ELb1EEEvN2at27GenericPackedTensorAccessorIhLm1ENS3_17RestrictPtrTraitsElEES6_NS4_IiLm1ES5_iEENS4_IlLm1ES5_iEENS4_IhLm1ES5_iEES7_N10fbgemm_gpu12FixedDiv isorENS4_IT_Lm1ES5_iEESD_llNS4_IfLm1ES5_iEENS4_IT0_Lm2ES5_iEENS4_IhLm2ES5_lEES7_' uses too much shared data (0x10080 bytes, 0xc000 max) ``` Currently we reduce `InputRowsInFlight` to bypass the issue (the static shared memory used in the kernel is ``` typedef uint4 AllBuffers[WarpsPerBlock][OutputRowsPerThread][InputRowsInFlight][NumUint4LoadsPerRow]; __shared__ AllBuffers buffers; ``` Long term, we can change the static shared memory to dynamic shared memory, and increase the shared memory size to be 64 KB +. Reviewed By: wushirong Differential Revision: D44270081 fbshipit-source-id: 367ae838ea073dfe58d859ea3c0e6c7190beca6a * Containerize the remaining FBGEMM_GPU CI jobs (#1658) Summary: - Containerize the remaining FBGEMM_GPU CI jobs - Add Conda cleanups to make PyTorch and CUDA installs more reliable - Update post-install checks for PyTorch to work with ROCm - Update the CI to continue running on jobs that fail on just a few variants - Use PIP to install PyTorch GPU nightly as the nightly packages show up in PIP more reliably than in Conda Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1658 Reviewed By: shintaro-iwasaki Differential Revision: D44306708 Pulled By: q10 fbshipit-source-id: 5f0862f18eca7151759d9983aa97849222539d7d * Add tbe_input_combine_with_length for GPU (#1647) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1647 Implement `tbe_input_combine_with_length` for GPU. The operator takes 3 lists of tensors (`indices`, `lengths`, and `per_sample_weights`) and concatenates each one into a single tensor. Implicit type casting is also performed if the input types are different from the output types. `indices` and `lengths` tensors can be of type `int32_t` or `int64_t`. The outputs for `indices` concatenation and `lengths` concatenation are fixed to `int32_t`. `per_sample_weights` must be `float`. Reviewed By: bangshengtang Differential Revision: D44076452 fbshipit-source-id: f6ce8628e7345093bb55835f9523870c2914516f * jagged_jagged_bmm operator optimization (#1644) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1644 This diff optimizes the jagged_jagged_bmm operator using tiling across thread blocks and GPU shared memory. Reviewed By: brad-mengchi Differential Revision: D44029528 fbshipit-source-id: fa5cd5a26893f935427bce5efb7dfcc731c3f47d * Specify device to emulate_cache_miss kernel (#1660) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1660 When enabled emulate cache miss, it caused illegal memory access, if we're using more than one GPU. It turns out that previous diff didn't specify device within emulate_cache_miss kernel. This diff fixes it. In addition, cleaned up a bit (e.g., no need to used index_t based kernel launch for emulate_cache_miss kernel, as lxu_cache_locations is always with int32_t. Reviewed By: sryap, YuzeDaiMeta Differential Revision: D44340131 fbshipit-source-id: d99ba2364e9030cbca6c1166e578d24d99646bb1 * Add C++17 Support to FBGEMM and FBGEMM_GPU OSS builds (#1652) Summary: - Add C++17 support for the entire FBGEMM_GPU build - Add C++17 support for the entire FBGEMM build - Update FBGEMM tests and benchmarks to be C++17-compatible - Make FBGEMM builds output more logging - Cherry-pick code changes from D43776442 v4 now that C++17 is fully supported Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1652 Reviewed By: shintaro-iwasaki Differential Revision: D44287321 Pulled By: q10 fbshipit-source-id: 4bf2bcf66d528939865d42b6deafc470bee55d17 * Prune CPU/GPU TBE optimizer codegen (#1659) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1659 This diff aims to reduce the build time and libary size of `//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops`. The diff modifies the build target to generate and compile only the necessary files. This is based on the fact that CPU and GPU do not support all optimizers in `SplitTBE`. (Before this diff, all optimizers were generated and compiled for both CPU and GPU.) The following is the list of supported optimizers |OptimType|Generated optimizer|Supported on CPU|Supported on GPU| |EXACT_ADAGRAD|adagrad|x|x| |EXACT_ROWWISE_ADAGRAD|rowwise_adagrad_with_counter|x|x| ||rowwise_adagrad|x|x| |EXACT_ROWWISE_WEIGHTED_ADAGRAD|rowwise_weighted_adagrad|x|x| |EXACT_SGD|sgd|x|x| |SGD|approx_sgd|x|x| |ROWWISE_ADAGRAD|approx_rowwise_adagrad_with_counter|x|| ||approx_rowwise_adagrad|x|| |ADAM|adam||x| |LAMB|lamb||x| |LARS_SGD|lars_sgd||x| |PARTIAL_ROWWISE_ADAM|partial_rowwise_adam||x| |PARTIAL_ROWWISE_LAMB|partial_rowwise_lamb||x| |-|rowwise_adagrad_with_weight_decay||| |-|approx_rowwise_adagrad_with_weight_decay||| Note: x = supported Reviewed By: jianyuh Differential Revision: D44326540 fbshipit-source-id: 02413256b4a675f13ada8e8820820cb5112cb405 * Fix the Documentation Build Job (#1673) Summary: - Rewrite the documentation builds job to use the build infrastructure tooling - Rename workflow files for consistency Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1673 Reviewed By: shintaro-iwasaki Differential Revision: D44472660 Pulled By: q10 fbshipit-source-id: 60434c1f7098b7efa8c750133bb22f14fc98d5dc * Back out "Prune CPU/GPU TBE optimizer codegen" (#1675) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1675 Original commit changeset: 02413256b4a6 Original Phabricator Diff: D44326540 Reviewed By: q10, jianyuh Differential Revision: D44475251 fbshipit-source-id: 5be66944a833e03a2737fc6d1baaa5c351455b2c * Prepare bounds_check_indices for VBE (#1633) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1633 Prepare `bounds_check_indices` for variable batch size TBE (VBE). - Update the frontend API to accept VBE args - Update the backend logic to process VBE data Reviewed By: jianyuh Differential Revision: D43253703 fbshipit-source-id: 2870f0c41a96265650281a9b6362d4e6dc48009b * Move pruning/index_remapping support to embedding inplace update files (#1667) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1667 As title. This diff moves pruning/index_remapping support to embedding inplace update files. Reviewed By: jianyuh Differential Revision: D44409419 fbshipit-source-id: 93fc91d83502eb95cb0feca2a8a03b003c336078 * jagged_softmax forward optimization (#1661) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1661 This diff optimizes jagged_softmax forward with more efficient reduction from cub library. Reviewed By: brad-mengchi Differential Revision: D44161021 fbshipit-source-id: bf2e059d14ef4d7ad311edac65155a463ba653ff * jagged_softmax backward optimization (#1662) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1662 This diff optimizes jagged_softmax backward with more efficient reduction from cub library Reviewed By: brad-mengchi Differential Revision: D44205819 fbshipit-source-id: cd1d7a886d6ba68201dc1ad782c2e8cde7ff706b * multi-gpu all_to_one improvements (#1674) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1674 improved multi-gpu all_to_one with: 1. new intermediate hop selection taking advantage of distinct NVLinks 2. overlapping of intermediate hop transfers with each-other and with direct-peer transfers Reviewed By: doehyun Differential Revision: D44285941 fbshipit-source-id: 0202083f04388b5ba60b8155809433f334993ef4 * Extract and export weights offsets/placements initialization functions (#1669) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1669 Extract portions initializing the weights_placements/offsets tensors into separate functions and jit.export them. SplitState is converted to a NamedTuple since we can't jit.script a dataclass that also holds an enum. Reviewed By: houseroad Differential Revision: D44338256 fbshipit-source-id: e1c12e5956f7217d51cd190958c3764d220e521d * Fix the ROCm Test Job (#1668) Summary: - Clean up the ROCm test job and re-enable ROCm testing on the rocm instances. - Update the build scripts framework to build FBGEMM_GPU against the correct hardware target that it is intended to be tested on. One thing that was discovered was that if FBGEMM_GPU was built with `PYTORCH_ROCM_ARCH=gfx90a` but run on `gfx908` target, the tests will fail with a segfault. While the failure is expected, the segfault can be unfriendly and confusing for users. - Enable correct compilation of `merge_pooled_embeddings` operator under ROCm - Fix existing code in `jagged_tensor_ops` from PR https://github.com/pytorch/FBGEMM/issues/1661 and https://github.com/pytorch/FBGEMM/issues/1662 that break its compilation under ROCm 5.3 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1668 Reviewed By: shintaro-iwasaki Differential Revision: D44453594 Pulled By: q10 fbshipit-source-id: 2030cd0e00c6ff9694c2783dfd62c31cf5543da2 * Use exported functions instead of calling initialize_weights in weights loading (#1676) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1676 Export a function to reset the embedding specs by target location Reviewed By: RoshanPAN, houseroad Differential Revision: D44338258 fbshipit-source-id: 502733e9f3a164450a02656d2822492fbf69f994 * Extract index remappings array initialization and jit.export it (#1670) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1670 ATT Reviewed By: RoshanPAN, houseroad Differential Revision: D44338257 fbshipit-source-id: c091666c7a4d294c283f5e3774d0494089fc3478 * Disable COUNTER in FBGEMM test (#1683) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1683 Disable FBGEMM test on COUNTER mode temporarily. Reviewed By: sryap Differential Revision: D44589052 fbshipit-source-id: f2af6f9e3cce75d4c599c4708055e5f52ac705e2 * update hipify_torch and remove manual mapping of C10 macros (#1682) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1682 Reviewed By: shintaro-iwasaki Differential Revision: D44599348 Pulled By: q10 fbshipit-source-id: 8f968a7c21b09358eac070a35ee15d5b767ea94c * Install NVIDIA Drivers on Instances Missing the Drivers (#1684) Summary: - Use the pytorch/test-infra action ot install NVIDIA drivers properly if the instance is missing the drivers Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1684 Reviewed By: shintaro-iwasaki Differential Revision: D44603925 Pulled By: q10 fbshipit-source-id: 712bdf5c2af67c5a6f540567abcc47ed892912c1 * Clean up the linting job (#1686) Summary: Sumary: - Clean up the linting job to use the build scripts infrastructure - Delete the Conda prefix directory before creating a new environment, if it exists Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1686 Reviewed By: shintaro-iwasaki Differential Revision: D44646234 Pulled By: q10 fbshipit-source-id: d754efeadffb265c9e55bc302606fc1e60ef8b51 * reduce_to_one (#1571) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1571 reduce_to_one for row-wise sharding in inference Similar approach to all_to_one but without having the source waiting for target to be ready for potential WAR and WAW dependency violation because in this reduce_to_one implementation we create a new destination tensor. Reviewed By: xing-liu, jianyuh Differential Revision: D34263436 fbshipit-source-id: 7b1630b395311cfd6fef124113436f87f51a6fba * Reorganize the build scripts (#1685) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1685 Reviewed By: r-barnes, shintaro-iwasaki Differential Revision: D44654808 Pulled By: q10 fbshipit-source-id: a58987b4a3970139bba72db8cecc89c0256fba76 * Prune CPU/GPU TBE optimizer codegen (#1678) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1678 This diff aims to reduce the build time and libary size of `//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops`. [1/2] Update `lookup_invoker` to enable the function invoker based on `has_cpu_support` and `has_gpu_support` [2/2] Update the code generation part The diff modifies the build target to generate and compile only the necessary files. This is based on the fact that CPU and GPU do not support all optimizers in `SplitTBE`. (Before this diff, all optimizers were generated and compiled for both CPU and GPU.) The following is the list of supported optimizers |OptimType|Generated optimizer|Supported on CPU|Supported on GPU| |EXACT_ADAGRAD|adagrad|x|x| |EXACT_ROWWISE_ADAGRAD|rowwise_adagrad_with_counter|x|x| ||rowwise_adagrad|x|x| |EXACT_ROWWISE_WEIGHTED_ADAGRAD|rowwise_weighted_adagrad|x|x| |EXACT_SGD|sgd|x|x| |SGD|approx_sgd|x|x| |ROWWISE_ADAGRAD|approx_rowwise_adagrad_with_counter|x|| ||approx_rowwise_adagrad|x|| |ADAM|adam||x| |LAMB|lamb||x| |LARS_SGD|lars_sgd||x| |PARTIAL_ROWWISE_ADAM|partial_rowwise_adam||x| |PARTIAL_ROWWISE_LAMB|partial_rowwise_lamb||x| |-|rowwise_adagrad_with_weight_decay||| |-|approx_rowwise_adagrad_with_weight_decay||| Reviewed By: q10 Differential Revision: D44484764 fbshipit-source-id: f04710e66498bdcbdad619d48411c2403316901c * thread tiling for jagged_jagged_bmm (#1691) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1691 This diff adds thread tiling optimization in jagged_jagged_bmm operator, where each thread will process a tile of elements instead of one. The implementation is similar to the one applied to jagged_dense_bmm: D43674845. Reviewed By: brad-mengchi Differential Revision: D44764339 fbshipit-source-id: ca4cf257bac755ab97754fdc6605072cfbfb1c4d * tune the tile sizes for jagged_dense_bmm (#1692) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1692 Tune the tile sizes based on the input tensor size. If M > N, then use larger tile size in M dimension, otherwise use larger tile size in N dimension. Reviewed By: brad-mengchi Differential Revision: D44791699 fbshipit-source-id: 348a66089d781e9fef141b63d7a56e6dfa5da905 * Populate supported optims to match OSS Pytorch state dict (#1632) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1632 ATT. Reviewed By: jianyuh Differential Revision: D43887969 fbshipit-source-id: 048ff61a925113b29c547abf20d7acdc4a50b8d7 * Build Scripts and README Improvements (#1695) Summary: - Update build scripts to print out cc, c++, and nvcc preprocessor defines - Print out all undefined symbols in the output library after build to inspect whether or not templates have been un-instantiated - Handle the case where `TORCH_CUDA_ARCH_LIST` is pre-defined in the environment - Clean up the FBGEMM_GPU READMEs to consolidate all FBGEMM_GPU build instructions into `docs/BuildInstructions.md` - Fix the build badges for FBGEMM and FBGEMM_GPU - Add Slack contact information to the READMEs - Remove deprecated GitHub workflows and build scripts in favor of the new scripts, which cover all the functionality of the old scripts Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1695 Reviewed By: shintaro-iwasaki Differential Revision: D44901368 Pulled By: q10 fbshipit-source-id: bef6045347c905a051970e4e5f8630175e0f5ef6 * Add Documentation to Work Around GCC 12 Regressions (#1697) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1697 Reviewed By: shintaro-iwasaki Differential Revision: D44935915 Pulled By: q10 fbshipit-source-id: e1bdd4ebff18bd9708208a5b659ef9a93ebc866a * Fix build instructions (#1701) Summary: This change fixes a missing step (cd) in the build instructions. Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1701 Reviewed By: sryap Differential Revision: D45011147 Pulled By: q10 fbshipit-source-id: 704ce5bd3cfbd62c31f434c830a7300e5d645024 * Fix a build error from -Wno-unused-but-set-variable (#1702) Summary: This project is compiled with -Wall and -Werror (see https://github.com/pytorch/FBGEMM/pull/868) and is throwing an error for the unused variable here. This code appears to be debugging code that was used to verify that the function it's contained in was originally implemented properly so the most straightforward solution is to just remove it. Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1702 Reviewed By: sryap Differential Revision: D45011174 Pulled By: q10 fbshipit-source-id: 2c252cfa6063789371f5fba5f642c2f4fb72455f * Fix exception in QuantUtilsTest (#1703) Summary: This test mistakenly calls reserve() to set a vector's length instead of resize(). reserve() allocates memory for the specified number of elements, but does not actually increase the number of elements that can legally be stored in the vector. This test runs with ASAN enabled which is catching this illegal access and causing the test to fail. This change fixes the code to instead call resize(); the test now passes. Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1703 Reviewed By: sryap Differential Revision: D45011317 Pulled By: q10 fbshipit-source-id: 2840d7bfcfb46ca1523f55e77a3834a1d561c045 * Support EXACT_ADAGRAD in `get_optimizer_state` (#1700) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1700 This diff support `get_optimizer_state` for exact_adagrad. Exact_adagrad is not supported in `get_optimizer_state`. However, this is needed for creating fused optimizer in torchrec. Reviewed By: r-barnes Differential Revision: D44963975 fbshipit-source-id: e2f523dfc1e1d17a4925e7ce4a9e65829f1cf1b0 * Split the Rendering of `embedding_forward_quantized_split_template.cu` into Smaller Files (#1694) Summary: `embedding_forward_quantized_split_template.cu` is a very large jinja template that renders 30+ C++ templates, which are then instantiated to over 600+ kernel functions. There are three sets of jinja templates in `embedding_forward_quantized_split_template.cu`: those related to `int_nbit_split_embedding_*`, `pruned_hashmap_lookup_*` and `pruned_array_lookup_*`.. Currently, the rendering produces a single file, which takes a large amount of time to compile. This PR does two things at a high level. First, it breaks up the jinja template into multiple jinja templates. Then, it forces each of these smaller jinja templates to render multiple source files instead of a single source file. This change will enable build parallelization and overall build time savings. Details: - Port improvements to `embedding_forward_quantized_split_template.cu` from D44707812 - Move the non-jinja-template code inside `embedding_forward_quantized_split_template.cu` over to `embedding_forward_template_helpers.cuh` - Move `pruned_hashmap_lookup_*` and `pruned_array_lookup_*` sets of jinja templates out to non-jinja-template `embedding_forward_quantized_split_lookup.cu`, since the template generated functions are redundant. - Break the `int_nbit_split_embedding_*` set of jinja templates into two files, one for rendering kernel-side code (`embedding_forward_quantized_split_nbit_kernel_template.cu`) and the other for rendering host-side code (`embedding_forward_quantized_split_nbit_host_template.cu`) - For the `int_nbit_split_embedding_*` host-side jinja template, make it render `weighted`, `unweighted`, and `unweighted nobag` variants into separate source files - For the `int_nbit_split_embedding_*` kernel-side jinja template, make it render into N = [`weighted`, `unweighted`, and `unweighted nobag` variants ] x [ 6 embedding types ] separate source files, each containing a single C++ template kernel function. Also generate the code to explicitly instantiate the kernel templates. For each of the C++ templates being generated, there will be 2 {device-only bool} x [3-4] (output types) x [3-5] (cases) = 18-40 actual template instantiations - To help with debugging missing template instantiations, print out all undefined symbols in the output library after build to inspect whether or not templates have been un-instantiated - Update build scripts to print out `cc`, `c++`, and `nvcc` preprocessor defines - Handle the case where `TORCH_CUDA_ARCH_LIST` is pre-defined in the environment Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1694 Reviewed By: sryap, r-barnes Differential Revision: D44842524 Pulled By: q10 fbshipit-source-id: 96f92e40ab2fec598aeb8c483e94997ac050aae7 * Back out "Prune CPU/GPU TBE optimizer codegen" (#1706) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1706 Original commit changeset: f04710e66498 Original Phabricator Diff: D44484764 Reviewed By: q10, brad-mengchi, jianyuh, shintaro-iwasaki Differential Revision: D45054051 fbshipit-source-id: 9d14504c76eb93b2f1b14f4c2ec4c5b807c7fc4a * Use CUB kernel for 2D asynchronous_complete_cumsum (#1707) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1707 Temporarily use the CUB kernel instead of the custom kernel for 2D `asynchronous_complete_cumsum` Reviewed By: q10, brad-mengchi, jianyuh Differential Revision: D45062784 fbshipit-source-id: cebe3992ff8ebec9c0f554e729b8d79a1eced1de * Split the Code Generation for `embedding_backward_split_template.cu` into Smaller Files (#1705) Summary: `embedding_backward_split_template.cu` contains both jinja-template and non-jinja-template code, and some of the templating is unneccessary. Furthermore, the template generates both the vanilla and `nobag` variants of unweighted into the same source file. This PR moves the non-jinja-template code out of the template, de-duplicates code are unneccessarily templated, and splits the generation of the code to three files per optimizer, one for `weighted`, `unweighted nobag`, and `unweighted`. Details: - Migrate non-jinja-templated code out of `embedding_backward_split_template.cu` and into `embedding_backward_template_helpers.cuh` - De-templatize `split_embedding_backward_codegen_{{ optimizer }}_{{ wdesc }}_find_long_segments` into `split_embedding_backward_codegen_find_long_segments` since there is no implementation difference between the optimizers and weighted vs unweighted - Migrate `grad_mean_kernel` and `split_embedding_backward_codegen_find_long_segments` into a separate non-template source file to de-duplicate code generation and compilation - Split the code generation of `embedding_backward_split_template.cu` into 3 files per optimizer, according to weighted, unweighted_nobag, and unweighted Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1705 Reviewed By: sryap Differential Revision: D45073273 Pulled By: q10 fbshipit-source-id: e82ea643f8e67ad5aa0b3de03562532c5735453d * Add jagged slice op for cpu (#1690) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1690 The context why this is needed is as follows 1) For really long sparse features we want to split them into multiple chunks that can be fed into the model 2) Slicing requires users to require per row start point & a maximum L. Based on these requirements, a custom op mimicing the slice semantics of a normal tensor works best. An example usage using pseudo code ``` input_jagged_tensor = [[1, 2, 3, 4], [1, 2, 3], [1, 2, 3, 4, 5, 6], [1], [1, 2]] start = [0, 0, 0, 0, 0] slice_length = 3 >> jagged_slice(input_jagged_tensor, start, slice_length) output_jagged_tensor = [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1], [1, 2]] ``` A corresponding operation for dense tensor would look like ``` dense_tensor = torch.randn((8, 10)) slice_dense_tensor = dense_tensor[:, 1:3] ``` Reviewed By: sryap Differential Revision: D44299744 fbshipit-source-id: 44996f2f2ec5fc5f31dda4cb3bd8f0241497df66 * Move radix sort to common utilities and add the possibility to handle negative integers (#1672) Summary: Move the `radix_sort` implementation to common utilities, so it can be used in PyTorch in case it was not built with FBGEMM GPU. Add the possibility to handle negative integers, which is crucial for reusing `radix_sort` in PyTorch's `sort` operation. Details: This PR addresses two issues: 1. `radix_sort` is currently used in [scatter_reduce](https://github.com/dszwicht/pytorch/blob/master/aten/src/ATen/native/cpu/ScatterGatherKernel.cpp#L630) (please view this [comment](https://github.com/pytorch/pytorch/pull/82703/files#r1045360609) for more information). Till now `radix_sort` was under `fbgemm_gpu` subproject. It means that implementation was not available in PyTorch in case it was built for CPU - that's why `radix_sort` was copy pasted under aten directory in PyTorch. This PR moves `radix_sort` implementation to common utilities. 2. In GNN workloads we often sort 1D integer data with non-negative values, for example, when converting CSR to CSC format. Unfortunately, `torch.sort` for 1D data works sequentially. `radix_sort` seems to be a perfect match to accelerate described case. However, suppose we want to do that on the PyTorch site. In that case, we have to either fallback to a regular path after detecting negative numbers in the tensor or perform post-processing, by swapping positive and negative blocks of data (data like `[2, -1, -2, 1]` after sorting will be in the following form `[1, 2, -2, -1]`, due to the fact of how numbers are stored). Both these solutions are not elegant. As an alternative, I propose the extension of `radix_sort` algorithm, by giving it capability to work with negative numbers. This can be enabled by passing an optional parameter, `maybe_with_neg_vals`. If set to `true`, we will perform all passes (up to the most significant sign bit) and apply a special prefix sum combination in the last pass. An example of how we can reuse fbgemm in PyTorch can be found in my private fork, [here](https://github.com/dszwicht/pytorch/pull/2) (I also provide speedup data). The above changes have several consequences: 1. `TORCH_CHECK` was replaced with `assert` as fbgemm CPU does not have PyTorch in its dependencies. 2. `__builtin_clz` was replaced with manual implementation as `__builtin_clz` is not portable. Additional information for reviewers: I did perform benchmarks of `radix_sort` before and after my code modification. I didn't observe any performance drop. Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1672 Reviewed By: sryap Differential Revision: D44616959 Pulled By: q10 fbshipit-source-id: f34594478c94ec6610c05545feb2044b58d79d66 * Daily `arc lint --take CLANGFORMAT` Reviewed By: bigfootjon Differential Revision: D45141964 fbshipit-source-id: 58308a31522a3b1446835e358a93483b611c4b15 * `CMakeLists.txt` Cleanups (#1712) Summary: - Re-organize and comment the `CMakeLists.txt` for FBGEMM_GPU for better clarity - Disable verbose HIPCC warnings that are non-actionable when building the ROCm variant of FBGEMM_GPU Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1712 Reviewed By: shintaro-iwasaki Differential Revision: D45189904 Pulled By: q10 fbshipit-source-id: 3df6ff3b957886c64bc13fc6bc7a0147b74ee783 * support indices broadcast for reorder_batched_ad_indices (#1711) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1711 this is to support the case for request-only combined input sparse feature broadcast when `broadcast_indices` is enabled, the assumption for the inputs: - `cat_ad_offsets` and `cat_ad_indices` only contain the offsets and indices for the combined batches, where each batch only contain one instance (potentially multiple tables) - `reordered_cat_ad_offsets` needs to be after broadcasted, and contains `num_ads_in_batch * num_tables + 1` elements - `batch_offsets` is also after broadcasted - `num_indices_after_broadcast` is required to allocate the output buffer added coverage for the newly added branch Reviewed By: r-barnes Differential Revision: D45155887 fbshipit-source-id: 67f96d60168aa83cf24fef459addee89f06e1c6b * Add a check that get_filelist python exec process worked (#1715) Summary: Add a check that get_filelist python exec worked. If bad params (python, args, ...), get_filelist() was continuing without noticing/warning/erroring out, making cmake failing later for weird reasons ("no sources"). Adds a safety check on the RESULT_VARIABLE of cmake execute_process(). Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1715 Reviewed By: shintaro-iwasaki Differential Revision: D45235231 Pulled By: q10 fbshipit-source-id: 049eae1fc5d7f42d73048e81c02c2f282d8859b0 * Fix compilation error under ROCm 5.3 (#1719) Summary: - Fix bug introduced by PR 1711 (D45155887), which broke compilation of FBGEMM_GPU under ROCm 5.3 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1719 Reviewed By: sryap Differential Revision: D45238536 Pulled By: q10 fbshipit-source-id: de9d2aa01ced0a37be1ea7903a361e3a24beed8d * Backward Split, pt. 2: Migrate `*_warp_per_row` and `*_cta_per_row` kernel templates out of `embedding_backward_split_template.cu` (#1710) Summary: - Migrate the definition of `split_embedding_*_backward_codegen_*_*_kernel_warp_per_row_1` from `embedding_backward_split_template.cu` over to `embedding_backward_split_kernel_warp_template.cu` and explicitly instantiate the templates separately - Migrate the definition of `split_embedding_*_backward_codegen_*_*_kernel_cta_per_row_1` from `embedding_backward_split_template.cu` over to `embedding_backward_split_kernel_cta_template.cu` and explicitly instantiate the templates separately Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1710 Reviewed By: sryap Differential Revision: D45205217 Pulled By: q10 fbshipit-source-id: 96b34e9389e70b64d8391f2c9d39f4009f3d65ce * Add CLI support (M,N,K) to GEMMsBenchmark (#1721) Summary: Add CLI support (M,N,K) to GEMMsBenchmark Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1721 Reviewed By: sryap Differential Revision: D45281533 Pulled By: q10 fbshipit-source-id: 0ce5b38f54877acb26421dead1d2dc63cd11a2a1 * Fix data conversion in radix sort that can cause data loss (#1718) Summary: Fix data conversion in `radix_sort` that can cause data loss. Details: When `elements_count` is passed to the internal kernel implementation it is implicitly converted from `int64_t` to `int`. It can cause data loss, resulting in a partially sorted array. This PR fixes this issue. As a result of changing the `elements_count` type in internal functions to `int64_t`, `histogram` and `histogram_ps` types also were updated (to not generate further conversions). This is a follow-up for https://github.com/pytorch/FBGEMM/issues/1672. Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1718 Reviewed By: sryap Differential Revision: D45253811 Pulled By: q10 fbshipit-source-id: a5368a4401f05ebc471cb17107297a48f43a75c0 * support lengths broadcast for reorder_batched_ad_lengths (#1716) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1716 similar to D45155887 when `broadcast_lengths` is enabled, the lengths are copied from the only instance of each batch, this is also to facilitate request-only broadcast Reviewed By: r-barnes Differential Revision: D45208736 fbshipit-source-id: 2c06cd4e9aae0c9c4e0668098de7db6f6da8c06b * remove optional for two ops (#1722) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1722 remove unnecessary optional decorators for the two newly added sparse ops Reviewed By: r-barnes Differential Revision: D45286152 fbshipit-source-id: 26109548db1acbc8fdf1a5183977eb8c64b45d41 * Prepare bounds_check_indices for VBE (#1713) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1713 Prepare bounds_check_indices for variable batch size TBE (VBE). - Update arg names Reviewed By: jspark1105, r-barnes Differential Revision: D45203680 fbshipit-source-id: 396c4122058db8dd1fc9eb5f0d620e8179c3e7a9 * Add check on configs and logging (#1728) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1728 Freq-SGD requires to set both `weight_decay_mode=WeightDecayMode.COUNTER` and `counter_based_regularization` to kick in. Previously we checked when `weight_decay_mode` is set but no config provided. There's another missing case when the config is provided but users forget to set `weight_decay_mode`. We add the check in this diff. In addition, added logging to print out whether **internally** counter is used or not to make debugging easier. Reviewed By: dvksabin Differential Revision: D45329516 fbshipit-source-id: 30389671c34a17d4baf48726f28096a670ede0b6 * Prepare transpose_embedding_input for VBE (#1717) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1717 Prepare `transpose_embedding_input` for variable batch size TBE (VBE). - Update the frontend API with new args Reviewed By: yuguo68 Differential Revision: D45212897 fbshipit-source-id: 5ad11a737130777fbe119aed6c7086e892752f4a * Convert GEMMsBench timebreakdown to a runtime cli option (#1725) Summary: Convert timebreakdown to a runtime cli option. Note: there is no code to measure packing, compute, kernel time ... so these are (atm) reported as 0, only total time is measured. ``` M, N, K, Type, Packing (us), Kernel(us), Postproc (us), Total (us), GOPs 64, 800, 320, FBGEMM_i8_acc32, 0, 0, 0, 218.593, 149.9 64, 800, 320, FBGEMM_i8_acc16, 0.0, 0.0, 0.0, 187.6, 174.7 ``` Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1725 Reviewed By: sryap Differential Revision: D45361847 Pulled By: q10 fbshipit-source-id: 4f2991a6208f0a5ae780729ce19bee611720953b * Fix error with empty row_counter_dev (#1730) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1730 In some cases, `torch.max(row_counter_dev)` causes failure because `row_counter_dev` is an empty tensor, example flow (f431977946). Here we guard the op by first checking if `row_counter_dev` is empty. Reviewed By: sryap Differential Revision: D45342010 fbshipit-source-id: 756a481c1098095f71dbb278ea84a01e89783790 * padding for fp8-rowwise quantization for varying length of 1D Tensor (#1729) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1729 As all gather becomes expensive for tensor/sequential parallel training, we create padded rowwise quantization/dequantization kernels for flattened tensor to convert between fp8 (stored as uint8 for gpu <= A100) and fp32 formats. Since the activations/grads will be concat into 1d tensor for all gather, the scaling to fit into fp8 format's range might be tricky as small elements will be quantized to zero if the scale is chosen to accommodate the largest element in the model. Thus, we continue to use row-wise quantization used in the previous all2all kernel. Every block with the size of "row_dim" will be quantized with the scale choose to accommodate the largest value in the block. Since the total length of the flattened tensor will not always be divisible by row_dim, we'll pad the 1D tensor to multiple of row_dim. As such, the padding/unpadding is handled by quantize/dequantize kernels and will be invisible to API calling them. Reviewed By: rohan-varma Differential Revision: D42721325 Privacy Context Container: L1138451 fbshipit-source-id: 33c712ba2fae709d29babee5ee4a8af6c7637b68 * Improve `TORCH_CHECK` diagnostics in files including deeplearning/fbgemm/fbgemm_gpu/codegen/embedding_forward_split_cpu.cpp (#1732) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1732 `TORCH_CHECK` produces pretty generic error messages. Using, eg, `TORCH_CHECK_GE` produces a message that shows the names of the variables being compared as well as their values at the time of comparison. This makes debugging easier. - If you approve of this diff, please use the "Accept & Ship" button :-) (7 files modified.) Reviewed By: bangshengtang Differential Revision: D45402701 fbshipit-source-id: 42501350543e31455e430b240e53f8e1883eb1ba * Improve `TORCH_CHECK` diagnostics in files including deeplearning/fbgemm/fbgemm_gpu/codegen/embedding_backward_dense_host.cpp (#1733) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1733 `TORCH_CHECK` produces pretty generic error messages. Using, eg, `TORCH_CHECK_GE` produces a message that shows the names of the variables being compared as well as their values at the time of comparison. This makes debugging easier. - If you approve of this diff, please use the "Accept & Ship" button :-) (7 files modified.) Reviewed By: bangshengtang Differential Revision: D45402700 fbshipit-source-id: 275bf837341a00d1cd4642b31bf9168455fa6c77 * Build cleanups (#1731) Summary: - Further break up `setup_env.bash` into separate domain scripts for easier maintenance - Update FBGEMM `CMakeLists.txt` to remove warning (https://github.com/pytorch/FBGEMM/issues/1714) Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1731 Reviewed By: sryap Differential Revision: D45406676 Pulled By: q10 fbshipit-source-id: 3ff5a7e2486b6898cb450d268a092371da5c2717 * Improve `TORCH_CHECK` diagnostics in files including deeplearning/fbgemm/fbgemm_gpu/fb/src/split_embeddings_utils.cu (#1735) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1735 `TORCH_CHECK` produces pretty generic error messages. Using, eg, `TORCH_CHECK_GE` produces a message that shows the names of the variables being compared as well as their values at the time of comparison. This makes debugging easier. - If you approve of this diff, please use the "Accept & Ship" button :-) (7 files modified.) Reviewed By: bangshengtang Differential Revision: D45402704 fbshipit-source-id: 9e9b1c1f526a398bbe50c99055187195ab751fa2 * Improve `TORCH_CHECK` diagnostics in files including deeplearning/fbgemm/fbgemm_gpu/src/split_embeddings_utils.cu (#1737) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1737 `TORCH_CHECK` produces pretty generic error messages. Using, eg, `TORCH_CHECK_GE` produces a message that shows the names of the variables being compared as well as their values at the time of comparison. This makes debugging easier. - If you approve of this diff, please use the "Accept & Ship" button :-) (3 files modified.) Reviewed By: bangshengtang Differential Revision: D45402697 fbshipit-source-id: c490d39bc826eab44ec16cbcc86273f8d7258fd9 * Use volatile pointer in inclusive_sum_scan_kernel (#1739) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1739 In the multi-block cumsum case, the `inclusive_sum_scan_kernel` implements the stream-scan technique in which each thread block has to consume the preceding sum result from the previous block. The sum result is passed via the `block_sums` buffer (global memory). To ensure that the sum results are visible for inter-thread-block consumption, the buffer has to be declared as `volatile` to prevent the compiler from caching the results in registers. This diff adds the `volatile` keyword to `block_sums`. Reviewed By: q10 Differential Revision: D45435897 fbshipit-source-id: f81a25b43eda18ae1eb18bed33f595fc27ef2707 * BF16 support for HBC ops. (#1744) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1744 Adding BF16 support for HBC ops, and updates on tests. Reviewed By: q10, sryap Differential Revision: D45449360 fbshipit-source-id: 8321155b426143d80064f12a910c0626bdfafbba * Use designated initializers & kernel launch checks in deeplearning/fbgemm/include/fbgemm/Utils.h (#1746) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1746 Designated initializers can make the code cleaner - If you approve of this diff, please use the "Accept & Ship" button :-) (1 files modified.) Reviewed By: sryap Differential Revision: D45464948 fbshipit-source-id: 28e38dc60b893fe7c91db0d791e069a6de87b420 * Dynamically determine platform name in FBGEMM scripts (#1742) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1742 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1738 Instead of hardcoding x86_64 when installing dependencies, let's now dynamically determine the platform name Reviewed By: excelle08 Differential Revision: D45246996 fbshipit-source-id: d9031e76a915c2362be62c85a3c1f0786828ca8b * Split the Rendering of `embedding_forward_split_template.cu` into Smaller Files (#1723) Summary: - Migrate `*_embedding_*_codegen_forward_*_kernel` out of `embedding_forward_split_template.cu` and into `embedding_forward_split_kernel_template.cu` - Migrate `*_embedding_nobag_codegen_forward_unweighted_small_kernel` out of `embedding_forward_split_template.cu` and into `embedding_forward_split_kernel_small_template.cu` Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1723 Reviewed By: sryap Differential Revision: D45363388 Pulled By: q10 fbshipit-source-id: 563ca610b15830aca854bc00d6a31fd6e8cb8a53 * Installation instructions for OSS (#1750) Summary: - Add installation instructions for OSS - Migrate Installation, Test, and Documentation information out of the README - Add link to GitHub Discussions in the README - Migrate the Netlify configuration from website to TOML file in the repo so that build jobs are configurable by developers Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/1750 Reviewed By: sryap, shintaro-iwasaki Differential Revision: D45540724 Pulled By: q10 fbshipit-source-id: beaab824cc5d441b96b89daea2a71f541e21f2ec --------- Co-authored-by: Banit Agrawal <bagrawal@meta.com> Co-authored-by: Sabin Devkota <devkotasabin@meta.com> Co-authored-by: Junjie Yang <junjieyang@meta.com> Co-authored-by: Benson Ma <bensonma415@meta.com> Co-authored-by: Alfredo Tupone <tupone@gentoo.org> Co-authored-by: Sarunya Pumma <sarunya@meta.com> Co-authored-by: Doe Hyun Yoon <dhyoon@meta.com> Co-authored-by: Matt Galloway <mattjgalloway@meta.com> Co-authored-by: Richard Barnes <rbarnes@meta.com> Co-authored-by: Xiao Sun <sunx@meta.com> Co-authored-by: Rengan Xu <renganxu@meta.com> Co-authored-by: siwasaki <siwasaki@fb.com> Co-authored-by: Jianyu Huang <jianyuhuang@meta.com> Co-authored-by: Yue Dong <yoyoyod@meta.com> Co-authored-by: Geet Sethi <gsethi@meta.com> Co-authored-by: Janet Yang <qxy11@meta.com> Co-authored-by: Wang Zhou <wangzhou@meta.com> Co-authored-by: Jongsoo Park <jongsoo@meta.com> Co-authored-by: Tran Le <quytranle@meta.com> Co-authored-by: Ryan Landay <rlanday@gmail.com> Co-authored-by: Devashish Tyagi <devashisht@meta.com> Co-authored-by: Szwichtenberg, Damian <damian.szwichtenberg@intel.com> Co-authored-by: generatedunixname89002005325676 <generatedunixname89002005325676@fb.com> Co-authored-by: Bangsheng Tang <bangsheng@meta.com> Co-authored-by: William Tambellini <wtambellini@sdl.com> Co-authored-by: Jason Park <jasonjk@meta.com>

Ifu 2023 05 25

Ifu 2023 06 14

Ifu 2023 06 26

Ifu 2023 07 12

Ifu 2023 08 31

FBGEMM Ifu 2023 10 26

Ifu 2023 11 03

Ifu 2023 12 06

…into IFU-2023-12-14

Ifu 2023 12 14

liligwu · 2024-09-24T19:57:51Z

This is a test branch and shouldn't be mreged.

jithunnair-amd and others added 30 commits January 25, 2022 14:18

Use SHEFL_SYNC_MACRO to replace __shefl() and __shefl_sync()

a223936

Merge pull request #6 from ROCmSoftwarePlatform/rocm4.3/develop

4610075

Rocm4.3/develop. Use SHEFL_SYNC_MACRO to replace __shefl() and __shefl_sync()

Change the hipify dependency to hipify_torch (#7)

a506c52

* Change hipify dependency from torch.utils.torch_hipify to hipify_torch. * add the third_party/hipify_torch to git repo

*Enable use_cache. *Enable split_embedding_inference_converter_test.p…

f13af44

…y by diabling use_cpu.

Skip use_cpu.

25e5b71

Enable test_nbit_cache_pipeline and test_cache_miss_counter.

dcbe19f

Enable quantize_ops_test.py

fda048e

Merge branch 'main' into use_cache_enabled

00abba1

Remove @skipIfRocm for test_nbit_cache_pipeline and test_cache_miss_c…

cf307b6

…ounter.

*Uncondition use_cache in split_table_batched_embeddings_test.py *Rem…

2d66ea8

…ove @skipIfRocm for TestFused8BitRowwiseQuantizationConversion and TestFusedNBitRowwiseQuantizationConversion

Merge pull request #11 from ROCmSoftwarePlatform/use_cache_enabled

958679b

Enable use_cache

Fix backward tests and test_cache_pipeline in split_table_batched_emb…

e642a48

…eddings_test.py

A minor change of removing a commented line.

d0d294a

Remove skipIfRocm import in split_table_batched_embeddings_test.py.

146f2df

Merge pull request #12 from ROCmSoftwarePlatform/fix_backward

eb0cf36

Fix backward

*Removed post_hipify logic in setup.py. *Removed two headerfiles that…

0c86f2b

… have been deleted in upstream.

Merge pull request #16 from ROCmSoftwarePlatform/remove_post_hipify

6e7f13e

Removed post_hipify logic in setup.py

Pointing hipify_torch to the newer commit.

edd3306

Merge pull request #17 from ROCmSoftwarePlatform/pointing_hipify_torc…

9a45f4a

…h_to_new_commit Pointing hipify_torch to the newer commit.

Fixing #include <ATen/CUDAGeneratorImpl.h> by defining NEW_GENERATOR_…

309a3a1

…PATH in setup.py. (#19)

Disabling all use_cpu in the tests. (#20)

358eaf5

Change py3.8 syntax to py3.7 syntax (#18)

3a915a8

Enable merge_pooled_embeddings op. in ROCm (#15)

69abf78

* Enable merge_pooled_embeddings op. in ROCm * Enabling the merge pool ops. Co-authored-by: liligwu <lili.cce.jlu@gmail.com>

Fixing test_lxu_cache_lookup in AMD devices where warp_siize=64

bfac874

* Enabling the specificationn of hip architecture by using PYTORCH_RO…

1cf7e84

…CM_ARCH. # Enabling building on Pytorch 1.11.

liligwu and others added 27 commits March 9, 2023 15:40

Merge remote-tracking branch 'upstream/main' into main

b26aa14

disable use_cpu in jagged_tensor_ops_test test_jagged_index_select_2d

d41f042

for ROCm

resolve confilicts

a25615e

Merge pull request #37 from ROCmSoftwarePlatform/IFU-2023-03-10

91f17f9

Ifu 2023 03 10

Merge remote-tracking branch 'upstream/main' into IFU-2023-05-25

e2ebbf0

Merge pull request #41 from ROCmSoftwarePlatform/IFU-2023-05-25

a2763cc

Ifu 2023 05 25

Merge remote-tracking branch 'upstream/main' into IFU-2023-06-14

6946888

Merge pull request #42 from ROCmSoftwarePlatform/IFU-2023-06-14

84e0d77

Ifu 2023 06 14

Merge remote-tracking branch 'upstream/main' into IFU-2023-06-26

c024fd3

Merge pull request #44 from ROCmSoftwarePlatform/IFU-2023-06-26

fcd1b33

Ifu 2023 06 26

Merge remote-tracking branch 'upstream/main' into IFU-2023-07-12

b09d286

Merge pull request #45 from ROCmSoftwarePlatform/IFU-2023-07-12

ae46d6b

Ifu 2023 07 12

Merge remote-tracking branch 'upstream/main' into IFU-2023-08-31

f5e9e59

Merge pull request #46 from ROCmSoftwarePlatform/IFU-2023-08-31

289af57

Ifu 2023 08 31

Merge remote-tracking branch 'upstream/main' into IFU-2023-10-26

f24435d

Merge pull request #47 from ROCmSoftwarePlatform/IFU-2023-10-26

83f4183

FBGEMM Ifu 2023 10 26

Merge remote-tracking branch 'upstream/main' into IFU-2023-11-03

4e36858

Merge pull request #48 from ROCmSoftwarePlatform/IFU-2023-11-03

06879b1

Ifu 2023 11 03

Merge remote-tracking branch 'upstream/main' into IFU-2023-12-06

5a2c43e

Merge pull request #49 from ROCmSoftwarePlatform/IFU-2023-12-06

61a7e50

Ifu 2023 12 06

Merge remote-tracking branch 'upstream/main' into IFU-2023-12-14

7509d42

Merge branch 'main' of https://github.com/ROCmSoftwarePlatform/FBGEMM …

10ace05

…into IFU-2023-12-14

Merge pull request #52 from ROCmSoftwarePlatform/IFU-2023-12-14

03b582b

Ifu 2023 12 14

test

a2c337e

liligwu force-pushed the main branch from 9f6a233 to ad83687 Compare August 9, 2024 19:41

liligwu closed this Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test #53

test #53

liligwu commented Dec 14, 2023

liligwu commented Sep 24, 2024

test #53

test #53

Conversation

liligwu commented Dec 14, 2023

liligwu commented Sep 24, 2024