forked from pytorch/FBGEMM
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ifu 2024 02 20 #56
Merged
Merged
Ifu 2024 02 20 #56
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: Pull Request resolved: pytorch#2303 This diff introduces a unit test to verify that TBE with UVM caching can handle the cache access correctly when the cache access offset is larger than max int32 Reviewed By: jspark1105 Differential Revision: D53300526 fbshipit-source-id: 0d6c731757037b2bd05604954ec064fab5d4be4b
Summary: Pull Request resolved: pytorch#2304 Removed unused args from `generate_cache_tbes` Reviewed By: q10 Differential Revision: D53305015 fbshipit-source-id: 12572d794e912f272e6c16dba6646d6e420ad6ec
Summary: Pull Request resolved: pytorch#2305 - Re-organize TBE tests, pt 5 Reviewed By: sryap Differential Revision: D53331779 fbshipit-source-id: c65c8565fc3bec883c70b5797549dd532e8d1add
…oise (pytorch#2306) Summary: Pull Request resolved: pytorch#2306 The original EmbeddingSpMDMNBitBenchmark runs Ref, autovec and asmjit kernels sequentially for each benchmark spec. This could cause instability in CPU frequency because SIMD workloads may cause throttling. This new nbit-CPU-TBE benchmark mitigates such frequency instability by using one kernel to run all benchmark specs and then switch to the next kernel to run all specs. Reviewed By: sryap, helloguo Differential Revision: D53104362 fbshipit-source-id: 076f2bb9a3eeb8db264810b3c55b8c4f5b026df0
Summary: Pull Request resolved: pytorch#2277 Add early exit to sparse_segment_sum_csr_cuda op in case of empty input. Add check for invalid input size. Reviewed By: sryap, jasonjk-park Differential Revision: D52963209 fbshipit-source-id: bec0192793e9be49018d47a751aae1dcf0ac8425
Summary: Pull Request resolved: pytorch#2308 Avoid conditional of module output for pt2 tracing, when uvm caching is disabled no need to prefetch Reviewed By: suo, sryap Differential Revision: D53361712 fbshipit-source-id: e1d1efb07d73f0d40dce856a5b2d99775a788b08
Summary: Pull Request resolved: pytorch#2307 - Suppress infinite-recursion warning in location that is known to not cause infinite recursion, as it is an implementation of pure virtual using CRTP Reviewed By: r-barnes Differential Revision: D53359145 fbshipit-source-id: 4a06134efecbda49d353a17fe60b9a6496ee1b32
Summary: Process terminated in CI during this test causing the job to fail. The test passes locally. Skip test_backward_dense to unblock nightly release for cpu variants for now. We will re-enable it back once it's being investigated. Pull Request resolved: pytorch#2311 Reviewed By: jspark1105 Differential Revision: D53486167 Pulled By: spcyppt fbshipit-source-id: 07f5d4292c7478bab26c875eb634a405b56cf5cf
Summary: Pull Request resolved: pytorch#2282 This follows up the work on D51865590 and D52679387 by plumbing the `uvm_cache_stats` argument passing up to the Python API level. `local_uvm_cache_stats` is now zeroed out before the prefetch step as opposed to after, to allow for the data to be passed into the forward step. This is a re-attempt of landing D51995949 with additions copied from D52670550 Reviewed By: spcyppt Differential Revision: D53033916 fbshipit-source-id: 747f81989b7deef1684a94e5f294fe1d772e2b42
…ch#2309) Summary: Pull Request resolved: pytorch#2309 **(1) This diff updates TBE to directly copy data between cache and embedding when embedding and cache have the same type** When UVM caching is enabled in training, TBE performs a two-step data conversion when evicting/flushing the cache. That is, (1) it reads data from cache and converts/quantizes data from the `cache_t` type to FP32, (2) it then converts FP32 data to the `emb_t` type and writes the data to the embedding table (note: `cache_t` is the cache type and `emb_t` is the embedding table type). This two-step data conversion is not necessary when `emb_t` and `cache_t` are the same. TBE can copy data between cache and embedding directly (i.e., no conversion required). **(2) This diff avoids `at::PhiloxCudaState` initialization during cache eviction/flushing when embedding and cache have the same type** When stochastic rounding is enabled, TBE randomizes some bits when converting data from FP32 to FP16/INT8. This randomization requires a random seed (namely `at::PhiloxCudaState`) which has to be initalized on host. Regardless of whether UVM caching is used, TBE applies stochastic rounding during the embedding table update. When UVM caching is enabled, TBE also applies stochastic round when evicting/flushing cache lines. For this reason, using UVM caching inits `at::PhiloxCudaState` for more number of times than not using UVM caching. The state of `at::PhiloxCudaState` is changed each time it is being initialized. This causes the results between using and not using UVM caching to be different even when the first random seed is the same. Although this behavior is sematically correct, it is hard to compare the results between using and not using UVM caching. Therefore, when stochastic rounding is not required (i.e., during cache eviction/flushing when `cache_t` and `emb_t` are the same), we do not initialize `at::PhiloxCudaState` to minimize non-deterministic behaviors between using and not using UVM caching. Reviewed By: jspark1105 Differential Revision: D53395344 fbshipit-source-id: 0d259f707aac78faa22b734f4241801c7f487eda
…torch#2295) Summary: Pull Request resolved: pytorch#2295 - Register the 2nd step operator `qlinear_quant` into FX stack - Add FX Kernel benchmark for dynamic quantized gemm step-2 - Use `quantize_step` parameter to differentiate different stages - Separate Net modules for step-2 vs step-1 -- result: https://fb-my.sharepoint.com/:x:/g/personal/jiyuanz_meta_com/Ec94q-KgmslMtQ7nIYT4240BZUyWiK-iQvP1cBgzfgEDWg?e=DfP82U 1K x 1K: 638 cycles (5.10 us) --> 411 GB/s 2K x 2K: 1200 cycles (9.6 us) --> 873 GB/s As a reference: 1K x 1K x 1K FP16 GEMM: 20.30 us 2K x 2K x 2K FP16 GEMM: 127.80 us Reviewed By: charliezjw Differential Revision: D52136852 fbshipit-source-id: 1f967549019e4662261ccdfd72a5b6e49d72120b
Summary: Pull Request resolved: pytorch#2313 - Fix OSS ufmt lint issues Reviewed By: spcyppt Differential Revision: D53489224 fbshipit-source-id: 01294913ef30b0fd41bd0dac4e1b6d8d6eaf5995
Summary: Pull Request resolved: pytorch#2317 As title Reviewed By: q10, jasonjk-park Differential Revision: D53445651 fbshipit-source-id: a7bde01cef19051342c468c796573034dcd77013
Summary: Pull Request resolved: pytorch#2316 * int8 output dtype is a gap for recently fbgemm usage case, setup a reasonable refimplementation first, memcpy based. * for sequence embedding, we first unblock dispatch via simple memcpy, it is a pure bw op(no dequant) so memcpy should be reasonably ok. further optimization like ILP via unrolling, try avx non-temp instruction, rep instruction to be done in future iterations. Reviewed By: sryap Differential Revision: D53449813 fbshipit-source-id: 5fb35c152612e4769cb4c28dd82ab9bae0c1b776
Summary: Pull Request resolved: pytorch#2315 Add option to provide output length for values if known to avoid D to H sync. For permutation cases using keyed jagged index select where batch size. == len(lengths), the output length will be known to be len(values). Reviewed By: sryap Differential Revision: D53461566 fbshipit-source-id: c4eb3d099e4a28351924ebe67e8e421ab4e727bb
…ytorch#2297) Summary: Pull Request resolved: pytorch#2297 result in https://fb-my.sharepoint.com/:x:/g/personal/jiyuanz_meta_com/Ec94q-KgmslMtQ7nIYT4240BZUyWiK-iQvP1cBgzfgEDWg?e=G62cgs Reviewed By: jspark1105 Differential Revision: D50590364 fbshipit-source-id: 0214a07fbb7bc1dd237f6af220eb863e71ec4472
Summary: Pull Request resolved: pytorch#2216 This adds TBE UVM caching benchmarks to support the work on D51865590 stack Reviewed By: sryap Differential Revision: D52177208 fbshipit-source-id: bc67acb2c76c332e2ee6f42e87cc082a698cd3c4
Summary: Pull Request resolved: pytorch#2320 Reviewed By: shintaro-iwasaki Differential Revision: D53549471 Pulled By: q10 fbshipit-source-id: 2b6195eba8fdf1d873e98e3a0f9c5b6c100c2fda
Summary: Pull Request resolved: pytorch#2321 as title ``` [zhuoran@devgpu003.snc8 /data/users/zhuoran/fbsource/fbcode (7932bb4ab|remote/fbsource/stable...)]$ HIP_VISIBLE_DEVICES=7 numactl --cpunodebind=1 --membind=1 buck2 run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //hammer/modules/sequential/encoders/tests:hstu_bench -- --enable-multi-stream=true --enable_profiler=true --num-streams=3 --num-workers=3 Watchman fresh instance: new mergebase, cleared graph state, cleared dep files ⚠ Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/tools/setup_helpers:gen_version_header to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting ⚠ Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2:substitute to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting ⚠ Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/tools/amd_build:build_amd to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting ⚠ Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/torchgen:gen to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting ⚠ Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/tools/setup_helpers:generate_code to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting Action failed: fbcode//deeplearning/fbgemm/fbgemm_gpu:sparse_ops_hip (hip_compile src/sparse_ops/sparse_group_index.hip (pic)) Remote command returned non-zero exit code 1 Reproduce locally: `frecli cas download-action f0569d85851723e287f08ed03c0bc831587c0a05f94c911fe0b204ddd7670d24:145` stdout: stderr: buck-out/v2/gen/fbcode/2ab98e452e15a67d/deeplearning/fbgemm/fbgemm_gpu/__sparse_ops_hip_hipify_gen__/out/src/sparse_ops/sparse_group_index.hip:11:10: fatal error: 'cuda_bf16.h' file not found #include <cuda_bf16.h> ^~~~~~~~~~~~~ 1 error generated when compiling for gfx90a. ``` Reviewed By: nrsatish, sryap, htyu Differential Revision: D53549323 fbshipit-source-id: 73753c91cbb4c327ff6952bfa7d889ef02b8a31f
Summary: Pull Request resolved: pytorch#2312 This diff addresses an issue with `StochasticRoundingRNGState` where it was previously allocated inside a function but its address was accessed after the function had returned, leading to illegal memory access. To address this, the allocation of `StochasticRoundingRNGState` has been moved outside of the function to ensure that it remains alive for all accesses, preventing any illegal memory access issues. Reviewed By: jspark1105 Differential Revision: D53462989 fbshipit-source-id: 9b962bcdc901f6ff62388c2a02ec6ea3068844fe
Summary: Pull Request resolved: pytorch#2323 Reviewed By: spcyppt Differential Revision: D53597686 Pulled By: q10 fbshipit-source-id: c53ad045e14fd05106dcea6c12dfd0d0212e53a0
Summary: Pull Request resolved: pytorch#2324 As title Reviewed By: jspark1105 Differential Revision: D53543704 fbshipit-source-id: 2861fc84c0151e1903e181992d8a4d9d4f7ce7f2
Summary: Pull Request resolved: pytorch#2326 Original commit changeset: a7bde01cef19 Original Phabricator Diff: D53445651 Reviewed By: yjhao Differential Revision: D53622702 fbshipit-source-id: c95514aa3901b49c08c691481665cc181c5f8cb3
…ytorch#2327) Summary: This op previously didn't have an autograd registration. a) We would see this warning: ``` /data/users/dberard/fbsource/buck-out/v2/gen/fbcode/6f27a84d3075b0d5/scripts/dberard/jplusd/__jagged_plus_dense__/jagged_plus_dense#link-tree/torch/autograd/graph.py:744: UserWarning: fbgemm::jagged_dense_elementwise_add_jagged_output: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at fbcode/caffe2/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:72.) ``` (b) Sometimes we would get aot_autograd partitioner issues because this op would not show up as an op returning a tensor Previous issue: a single implementation for both CPU and Autograd was registered, which would call DenseToJaggedOp::apply(); a separate CUDA implementation was registered which did not have a backward registration. Updated implementation: - added a CPU implementation which does `jagged + dense_to_jagged(dense, offsets)` - added an AutogradFunction implementation, which: in forward, redispatches to jagged_dense_elementwise_add_jagged_output; and in backward, redispatches to jagged_to_dense. Pull Request resolved: pytorch#2327 Reviewed By: williamwen42 Differential Revision: D53650907 Pulled By: davidberard98 fbshipit-source-id: d2cf5b2fe7c171f216963525ba499099d31423fb
Summary: - Allow for re-running tests for only the failed test cases, to avoid OOM errors with large test suites - Re-enable the `test_backward_dense` test Pull Request resolved: pytorch#2328 Reviewed By: sryap Differential Revision: D53688642 Pulled By: q10 fbshipit-source-id: 368698a3a2a5088c796ea188faa655a778b42b6b
Summary: * we add dequant support to input tensors with fp16 scale/ bias padding at the beginning of rows.(from fbgemm tbe storage). Reviewed By: jspark1105 Differential Revision: D53695003 fbshipit-source-id: 3ed62d99b29a2c52953e8781880480cabc402ce4
Summary: Pull Request resolved: pytorch#2330 Reviewed By: spcyppt Differential Revision: D53777574 Pulled By: q10 fbshipit-source-id: f6228cd941c5ece168540abc0699883c0c97570f
Summary: Pull Request resolved: pytorch#2335 As title Reviewed By: jspark1105 Differential Revision: D53831943 fbshipit-source-id: 5e6574e0c2120575822ae22f6823fa709566991b
Summary: Pull Request resolved: pytorch#2329 This diff enables the titular warning flag for the directory in question. Further details are in [this workplace post](https://fb.workplace.com/permalink.php?story_fbid=pfbid02XaWNiCVk69r1ghfvDVpujB8Hr9Y61uDvNakxiZFa2jwiPHscVdEQwCBHrmWZSyMRl&id=100051201402394). This is a low-risk diff. There are **no run-time effects** and the diff has already been observed to compile locally. **If the code compiles, it works; test errors are spurious.** If the diff does not pass, it will be closed automatically. Reviewed By: palmje Differential Revision: D53530303 fbshipit-source-id: 66000f69a67e80196f16c423d4bd12c52ce047c5
…2.h (pytorch#2339) Summary: Pull Request resolved: pytorch#2339 `-Wextra-semi` or `-Wextra-semi-stmt` found an extra semi If the code compiles, this is safe to land. Reviewed By: palmje, dmm-fb Differential Revision: D53776044 fbshipit-source-id: 6aea9b08da6e17e326cdbc3411cd25191c2e7ce3
Summary: Pull Request resolved: pytorch#2336 as titled, the number of offsets is not dynamic, rather it is the sum of the number of elements in the concatenated offset tensors additionally, make sure new tensors are created on the same device as the indices Reviewed By: khabinov Differential Revision: D53862184 fbshipit-source-id: 6d98ac9d47725a325fdd55b9ca877d6f75af3779
Summary: Enable Clang compilation in OSS for fbgemm_gpu (CUDA) Pull Request resolved: pytorch#2334 Reviewed By: sryap Differential Revision: D53882764 Pulled By: q10 fbshipit-source-id: bd4a09695c365b04a43a975c919c942726f357bd
…pytorch#2331) Summary: Pull Request resolved: pytorch#2331 As titled, this diffs changes kernel to parallelize the work of _block_bucketize_sparse_features_cuda_kernel1 within a row. The context here is that for IG DV365 models, we have row that is really long which makes the kernel slow. I need to think more about how to improve _block_bucketize_sparse_features_cuda_kernel**2**. Publishing the changes for kernel 1 first because it is significantly faster {F1455708370} Reviewed By: sryap Differential Revision: D53585964 fbshipit-source-id: bf44afdb24c217c9d82846392d41298982f97c5b
…h#2340) Summary: Pull Request resolved: pytorch#2340 X-link: pytorch/torchrec#1716 As title Reviewed By: jspark1105 Differential Revision: D52531661 fbshipit-source-id: 99d17e01f67b43f22c26ffbbb09393463f301fa6
Summary: Pull Request resolved: pytorch#2337 Add cache precision to the logging. Created from CodeHub with https://fburl.com/edit-in-codehub Reviewed By: sryap Differential Revision: D53867027 fbshipit-source-id: de59b16ec1e8a170ea1dc401ce7ff0533fdcfac1
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.