Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ifu 2024 02 20 #56

Merged
merged 36 commits into from
Feb 20, 2024
Merged

Ifu 2024 02 20 #56

merged 36 commits into from
Feb 20, 2024

Conversation

liligwu
Copy link
Collaborator

@liligwu liligwu commented Feb 20, 2024

No description provided.

sryap and others added 30 commits February 1, 2024 11:16
Summary:
Pull Request resolved: pytorch#2303

This diff introduces a unit test to verify that TBE with UVM caching
can handle the cache access correctly when the cache access offset is
larger than max int32

Reviewed By: jspark1105

Differential Revision: D53300526

fbshipit-source-id: 0d6c731757037b2bd05604954ec064fab5d4be4b
Summary:
Pull Request resolved: pytorch#2304

Removed unused args from `generate_cache_tbes`

Reviewed By: q10

Differential Revision: D53305015

fbshipit-source-id: 12572d794e912f272e6c16dba6646d6e420ad6ec
Summary:
Pull Request resolved: pytorch#2305

- Re-organize TBE tests, pt 5

Reviewed By: sryap

Differential Revision: D53331779

fbshipit-source-id: c65c8565fc3bec883c70b5797549dd532e8d1add
…oise (pytorch#2306)

Summary:
Pull Request resolved: pytorch#2306

The original EmbeddingSpMDMNBitBenchmark runs Ref, autovec and asmjit kernels sequentially for each benchmark spec. This could cause instability in CPU frequency because SIMD workloads may cause throttling. This new nbit-CPU-TBE benchmark mitigates such frequency instability by using one kernel to run all benchmark specs and then switch to the next kernel to run all specs.

Reviewed By: sryap, helloguo

Differential Revision: D53104362

fbshipit-source-id: 076f2bb9a3eeb8db264810b3c55b8c4f5b026df0
Summary:
Pull Request resolved: pytorch#2277

Add early exit to sparse_segment_sum_csr_cuda op in case of empty input. Add check for invalid input size.

Reviewed By: sryap, jasonjk-park

Differential Revision: D52963209

fbshipit-source-id: bec0192793e9be49018d47a751aae1dcf0ac8425
Summary:
Pull Request resolved: pytorch#2308

Avoid conditional of module output for pt2 tracing, when uvm caching is disabled no need to prefetch

Reviewed By: suo, sryap

Differential Revision: D53361712

fbshipit-source-id: e1d1efb07d73f0d40dce856a5b2d99775a788b08
Summary:
Pull Request resolved: pytorch#2307

- Suppress infinite-recursion warning in location that is known
to not cause infinite recursion, as it is an implementation of
pure virtual using CRTP

Reviewed By: r-barnes

Differential Revision: D53359145

fbshipit-source-id: 4a06134efecbda49d353a17fe60b9a6496ee1b32
Summary:
Process terminated in CI during this test causing the job to fail. The test passes locally.

Skip test_backward_dense to unblock nightly release for cpu variants for now. We will re-enable it back once it's being investigated.

Pull Request resolved: pytorch#2311

Reviewed By: jspark1105

Differential Revision: D53486167

Pulled By: spcyppt

fbshipit-source-id: 07f5d4292c7478bab26c875eb634a405b56cf5cf
Summary:
Pull Request resolved: pytorch#2282

This follows up the work on D51865590 and D52679387 by plumbing the `uvm_cache_stats` argument passing up to the Python API level.  `local_uvm_cache_stats` is now zeroed out before the prefetch step as opposed to after, to allow for the data to be passed into the forward step.

This is a re-attempt of landing D51995949 with additions copied from D52670550

Reviewed By: spcyppt

Differential Revision: D53033916

fbshipit-source-id: 747f81989b7deef1684a94e5f294fe1d772e2b42
…ch#2309)

Summary:
Pull Request resolved: pytorch#2309

**(1) This diff updates TBE to directly copy data between cache and
embedding when embedding and cache have the same type**

When UVM caching is enabled in training, TBE performs a two-step data
conversion when evicting/flushing the cache.  That is, (1) it reads
data from cache and converts/quantizes data from the `cache_t` type to
FP32, (2) it then converts FP32 data to the `emb_t` type and writes
the data to the embedding table (note: `cache_t` is the cache type and
`emb_t` is the embedding table type).  This two-step data conversion
is not necessary when `emb_t` and `cache_t` are the same.  TBE can
copy data between cache and embedding directly (i.e., no conversion
required).

**(2) This diff avoids `at::PhiloxCudaState` initialization during cache
eviction/flushing when embedding and cache have the same type**

When stochastic rounding is enabled, TBE randomizes some bits when
converting data from FP32 to FP16/INT8.  This randomization requires a
random seed (namely `at::PhiloxCudaState`) which has to be initalized
on host.  Regardless of whether UVM caching is used, TBE applies
stochastic rounding during the embedding table update.  When UVM
caching is enabled, TBE also applies stochastic round when
evicting/flushing cache lines.  For this reason, using UVM caching
inits `at::PhiloxCudaState` for more number of times than not using
UVM caching.  The state of `at::PhiloxCudaState` is changed each time
it is being initialized.  This causes the results between using and
not using UVM caching to be different even when the first random seed
is the same.  Although this behavior is sematically correct, it is
hard to compare the results between using and not using UVM caching.
Therefore, when stochastic rounding is not required (i.e., during
cache eviction/flushing when `cache_t` and `emb_t` are the same), we
do not initialize `at::PhiloxCudaState` to minimize non-deterministic
behaviors between using and not using UVM caching.

Reviewed By: jspark1105

Differential Revision: D53395344

fbshipit-source-id: 0d259f707aac78faa22b734f4241801c7f487eda
…torch#2295)

Summary:
Pull Request resolved: pytorch#2295

- Register the 2nd step operator `qlinear_quant` into FX stack
- Add FX Kernel benchmark for dynamic quantized gemm step-2
- Use `quantize_step` parameter to differentiate different stages
- Separate Net modules for step-2 vs step-1 --

result:
https://fb-my.sharepoint.com/:x:/g/personal/jiyuanz_meta_com/Ec94q-KgmslMtQ7nIYT4240BZUyWiK-iQvP1cBgzfgEDWg?e=DfP82U

1K x 1K: 638 cycles (5.10 us) --> 411 GB/s
2K x 2K: 1200 cycles (9.6 us) --> 873 GB/s

As a reference:
1K x 1K x 1K FP16 GEMM: 20.30 us
2K x 2K x 2K FP16 GEMM:  127.80 us

Reviewed By: charliezjw

Differential Revision: D52136852

fbshipit-source-id: 1f967549019e4662261ccdfd72a5b6e49d72120b
Summary:
Pull Request resolved: pytorch#2313

- Fix OSS ufmt lint issues

Reviewed By: spcyppt

Differential Revision: D53489224

fbshipit-source-id: 01294913ef30b0fd41bd0dac4e1b6d8d6eaf5995
Summary:
Pull Request resolved: pytorch#2317

As title

Reviewed By: q10, jasonjk-park

Differential Revision: D53445651

fbshipit-source-id: a7bde01cef19051342c468c796573034dcd77013
Summary:
Pull Request resolved: pytorch#2316

* int8 output dtype is a gap for recently fbgemm usage case, setup a reasonable refimplementation first, memcpy based.
* for sequence embedding, we first unblock dispatch via simple memcpy, it is a pure bw op(no dequant) so memcpy should be reasonably ok. further optimization like ILP via unrolling, try avx non-temp instruction, rep instruction to be done in future iterations.

Reviewed By: sryap

Differential Revision: D53449813

fbshipit-source-id: 5fb35c152612e4769cb4c28dd82ab9bae0c1b776
Summary:
Pull Request resolved: pytorch#2315

Add option to provide output length for values if known to avoid D to H sync. For permutation cases using keyed jagged index select where batch size. == len(lengths), the output length will be known to be len(values).

Reviewed By: sryap

Differential Revision: D53461566

fbshipit-source-id: c4eb3d099e4a28351924ebe67e8e421ab4e727bb
…ytorch#2297)

Summary:
Pull Request resolved: pytorch#2297

result in https://fb-my.sharepoint.com/:x:/g/personal/jiyuanz_meta_com/Ec94q-KgmslMtQ7nIYT4240BZUyWiK-iQvP1cBgzfgEDWg?e=G62cgs

Reviewed By: jspark1105

Differential Revision: D50590364

fbshipit-source-id: 0214a07fbb7bc1dd237f6af220eb863e71ec4472
Summary:
Pull Request resolved: pytorch#2216

This adds TBE UVM caching benchmarks to support the work on D51865590 stack

Reviewed By: sryap

Differential Revision: D52177208

fbshipit-source-id: bc67acb2c76c332e2ee6f42e87cc082a698cd3c4
Summary: Pull Request resolved: pytorch#2320

Reviewed By: shintaro-iwasaki

Differential Revision: D53549471

Pulled By: q10

fbshipit-source-id: 2b6195eba8fdf1d873e98e3a0f9c5b6c100c2fda
Summary:
Pull Request resolved: pytorch#2321

as title
```
[zhuoran@devgpu003.snc8 /data/users/zhuoran/fbsource/fbcode (7932bb4ab|remote/fbsource/stable...)]$ HIP_VISIBLE_DEVICES=7 numactl --cpunodebind=1 --membind=1 buck2 run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //hammer/modules/sequential/encoders/tests:hstu_bench -- --enable-multi-stream=true --enable_profiler=true --num-streams=3 --num-workers=3
Watchman fresh instance: new mergebase, cleared graph state, cleared dep files
 ⚠  Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/tools/setup_helpers:gen_version_header to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting
 ⚠  Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2:substitute to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting
 ⚠  Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/tools/amd_build:build_amd to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting
 ⚠  Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/torchgen:gen to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting
 ⚠  Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/tools/setup_helpers:generate_code to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting
Action failed: fbcode//deeplearning/fbgemm/fbgemm_gpu:sparse_ops_hip (hip_compile src/sparse_ops/sparse_group_index.hip (pic))
Remote command returned non-zero exit code 1
Reproduce locally: `frecli cas download-action f0569d85851723e287f08ed03c0bc831587c0a05f94c911fe0b204ddd7670d24:145`
stdout:
stderr:
buck-out/v2/gen/fbcode/2ab98e452e15a67d/deeplearning/fbgemm/fbgemm_gpu/__sparse_ops_hip_hipify_gen__/out/src/sparse_ops/sparse_group_index.hip:11:10: fatal error: 'cuda_bf16.h' file not found
#include <cuda_bf16.h>
         ^~~~~~~~~~~~~
1 error generated when compiling for gfx90a.
```

Reviewed By: nrsatish, sryap, htyu

Differential Revision: D53549323

fbshipit-source-id: 73753c91cbb4c327ff6952bfa7d889ef02b8a31f
Summary:
Pull Request resolved: pytorch#2312

This diff addresses an issue with `StochasticRoundingRNGState` where
it was previously allocated inside a function but its address was
accessed after the function had returned, leading to illegal memory
access.  To address this, the allocation of
`StochasticRoundingRNGState` has been moved outside of the function to
ensure that it remains alive for all accesses, preventing any illegal
memory access issues.

Reviewed By: jspark1105

Differential Revision: D53462989

fbshipit-source-id: 9b962bcdc901f6ff62388c2a02ec6ea3068844fe
Summary: Pull Request resolved: pytorch#2323

Reviewed By: spcyppt

Differential Revision: D53597686

Pulled By: q10

fbshipit-source-id: c53ad045e14fd05106dcea6c12dfd0d0212e53a0
Summary:
Pull Request resolved: pytorch#2324

As title

Reviewed By: jspark1105

Differential Revision: D53543704

fbshipit-source-id: 2861fc84c0151e1903e181992d8a4d9d4f7ce7f2
Summary:
Pull Request resolved: pytorch#2326

Original commit changeset: a7bde01cef19

Original Phabricator Diff: D53445651

Reviewed By: yjhao

Differential Revision: D53622702

fbshipit-source-id: c95514aa3901b49c08c691481665cc181c5f8cb3
…ytorch#2327)

Summary:
This op previously didn't have an autograd registration.

a) We would see this warning:
```
/data/users/dberard/fbsource/buck-out/v2/gen/fbcode/6f27a84d3075b0d5/scripts/dberard/jplusd/__jagged_plus_dense__/jagged_plus_dense#link-tree/torch/autograd/graph.py:744: UserWarning: fbgemm::jagged_dense_elementwise_add_jagged_output: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at fbcode/caffe2/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:72.)
```
(b) Sometimes we would get aot_autograd partitioner issues because this op would not show up as an op returning a tensor

Previous issue: a single implementation for both CPU and Autograd was registered, which would call DenseToJaggedOp::apply(); a separate CUDA implementation was registered which did not have a backward registration.

Updated implementation:
- added a CPU implementation which does `jagged + dense_to_jagged(dense, offsets)`
- added an AutogradFunction implementation, which: in forward, redispatches to jagged_dense_elementwise_add_jagged_output; and in backward, redispatches to jagged_to_dense.

Pull Request resolved: pytorch#2327

Reviewed By: williamwen42

Differential Revision: D53650907

Pulled By: davidberard98

fbshipit-source-id: d2cf5b2fe7c171f216963525ba499099d31423fb
Summary:
- Allow for re-running tests for only the failed test cases, to avoid OOM errors with large test suites

- Re-enable the `test_backward_dense` test

Pull Request resolved: pytorch#2328

Reviewed By: sryap

Differential Revision: D53688642

Pulled By: q10

fbshipit-source-id: 368698a3a2a5088c796ea188faa655a778b42b6b
Summary: * we add dequant support to input tensors with fp16 scale/ bias padding at the beginning of rows.(from fbgemm tbe storage).

Reviewed By: jspark1105

Differential Revision: D53695003

fbshipit-source-id: 3ed62d99b29a2c52953e8781880480cabc402ce4
Summary: Pull Request resolved: pytorch#2330

Reviewed By: spcyppt

Differential Revision: D53777574

Pulled By: q10

fbshipit-source-id: f6228cd941c5ece168540abc0699883c0c97570f
Summary:
Pull Request resolved: pytorch#2335

As title

Reviewed By: jspark1105

Differential Revision: D53831943

fbshipit-source-id: 5e6574e0c2120575822ae22f6823fa709566991b
Summary:
Pull Request resolved: pytorch#2329

This diff enables the titular warning flag for the directory in question. Further details are in [this workplace post](https://fb.workplace.com/permalink.php?story_fbid=pfbid02XaWNiCVk69r1ghfvDVpujB8Hr9Y61uDvNakxiZFa2jwiPHscVdEQwCBHrmWZSyMRl&id=100051201402394).

This is a low-risk diff. There are **no run-time effects** and the diff has already been observed to compile locally. **If the code compiles, it works; test errors are spurious.**

If the diff does not pass, it will be closed automatically.

Reviewed By: palmje

Differential Revision: D53530303

fbshipit-source-id: 66000f69a67e80196f16c423d4bd12c52ce047c5
…2.h (pytorch#2339)

Summary:
Pull Request resolved: pytorch#2339

`-Wextra-semi` or `-Wextra-semi-stmt` found an extra semi

If the code compiles, this is safe to land.

Reviewed By: palmje, dmm-fb

Differential Revision: D53776044

fbshipit-source-id: 6aea9b08da6e17e326cdbc3411cd25191c2e7ce3
bradleyhd and others added 6 commits February 19, 2024 09:54
Summary:
Pull Request resolved: pytorch#2336

as titled, the number of offsets is not dynamic, rather it is the sum of the number of elements in the concatenated offset tensors

additionally, make sure new tensors are created on the same device as the indices

Reviewed By: khabinov

Differential Revision: D53862184

fbshipit-source-id: 6d98ac9d47725a325fdd55b9ca877d6f75af3779
Summary:
Enable Clang compilation in OSS for fbgemm_gpu (CUDA)

Pull Request resolved: pytorch#2334

Reviewed By: sryap

Differential Revision: D53882764

Pulled By: q10

fbshipit-source-id: bd4a09695c365b04a43a975c919c942726f357bd
…pytorch#2331)

Summary:
Pull Request resolved: pytorch#2331

As titled, this diffs changes kernel to parallelize the work of _block_bucketize_sparse_features_cuda_kernel1 within a row.

The context here is that for IG DV365 models, we have row that is really long which makes the kernel slow.

I need to think more about how to improve _block_bucketize_sparse_features_cuda_kernel**2**. Publishing the changes for kernel 1 first because it is significantly faster

 {F1455708370}

Reviewed By: sryap

Differential Revision: D53585964

fbshipit-source-id: bf44afdb24c217c9d82846392d41298982f97c5b
…h#2340)

Summary:
Pull Request resolved: pytorch#2340

X-link: pytorch/torchrec#1716

As title

Reviewed By: jspark1105

Differential Revision: D52531661

fbshipit-source-id: 99d17e01f67b43f22c26ffbbb09393463f301fa6
Summary:
Pull Request resolved: pytorch#2337

Add cache precision to the logging.

Created from CodeHub with https://fburl.com/edit-in-codehub

Reviewed By: sryap

Differential Revision: D53867027

fbshipit-source-id: de59b16ec1e8a170ea1dc401ce7ff0533fdcfac1
@liligwu liligwu self-assigned this Feb 20, 2024
@liligwu liligwu merged commit 2d727fb into main Feb 20, 2024
47 of 58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.