Profile with kineto #79

amirakb89 · 2024-12-24T07:56:43Z

This PR reports the kernel time with kineto (pytorch profiler) to eliminate the CPU overhead of kernel launch for tiny kernels.

Reviewed By: jwfromm Differential Revision: D67293350 fbshipit-source-id: 76ee573031729fd918cbdc0133c14f3fbbe3decf

Summary: Pull Request resolved: pytorch#3507 X-link: facebookresearch/FBGEMM#588 https://fb.workplace.com/groups/1943855219377055/permalink/1982050665557510/ https://gitlab.arm.com/kleidi/kleidiai/-/commit/8e6db85154a9dd100d5553a20c0df6ee437eb745 pytorch@667ce9b Reviewed By: meyering Differential Revision: D66766306 fbshipit-source-id: f70ef581e214edad15aa0ca093d753adbacb163d

Summary: X-link: facebookresearch/FBGEMM#592 X-link: facebookresearch/FBGEMM#568 Pull Request resolved: pytorch#3488 - Break up D66310520 into backend and frontend diffs Reviewed By: leitian Differential Revision: D66986498 fbshipit-source-id: 1779a9a2a4611eda1298afc0e840839c7da46b10

Summary: Pull Request resolved: pytorch#3484 X-link: facebookresearch/FBGEMM#565 Use 3D grid to reduce the risk of running into grid size overflow in generate_vbe_metadata Reviewed By: r-barnes Differential Revision: D66948760 fbshipit-source-id: 505d9b72e0d74d1707e4aa0ab9af48f26cf18b4a

…3454) Summary: Pull Request resolved: pytorch#3454 X-link: facebookresearch/FBGEMM#538 2/2 of enabling bounds check V2 for APS FM, following APS principles, we would like to surface the V2 switch up to the APS user config, hence in this diff we are extending existing BoundsCheckMode with V2 counterparts, and pass the version flag into the operator. this diff enabled v2 via backward compatible modes update with V2 prefix which is intuitive for user to switch More context can be found in https://docs.google.com/document/d/1hEhk2isMOXuWPyQJxiOzNq0ivfECsZUT7kT_IBmou_I/edit?tab=t.0#heading=h.q89rllowo3eb Reviewed By: sryap Differential Revision: D66512098 fbshipit-source-id: d2181a82462ca1c2c93360d4108766edeb38d000

Summary: Pull Request resolved: pytorch#3444 X-link: facebookresearch/FBGEMM#530 This diff adds support for true dynamic M as is found in grouped_gemm. To do so, we add a new `zero_start_index_M` argument that must be provided by the user and indicates the number of non-zero M in each tensor. One nice thing about this approach is that we can now do a single kernel call to set up the gemm arguments. We make `zero_start_index_M` optional as it requires fixed N and K. When N and K vary across group, we use the previous static shape approach. Reviewed By: bradleyhd, jiawenliu64 Differential Revision: D66682886 fbshipit-source-id: 9c4554dba9becf33fcc87cd1b01266fead716916

Summary: Pull Request resolved: pytorch#3509 X-link: facebookresearch/FBGEMM#593 when calaculting num_thread and group_per_thread to distribute work, rounding gets accumulated and effectively expand the input space. for example (the new UT), when input tensor is (1, 2^31 - 8), ``` a.numel: 2147483640 num_threads: 46341 groups_per_thread: 1449 num_groups: 67108864 num_threads * groups_per_threads= 67148109 > num_groups ``` in kernel, when we try to access memory, input_start = num_threads * groups_per_threads * pid, so when pid is large, we end up visiting data outside the input Reviewed By: jwfromm Differential Revision: D67369392 fbshipit-source-id: 62c28fe3a94911a10921e233ff5ae42097e9dbb4

…3508) Summary: Pull Request resolved: pytorch#3508 X-link: facebookresearch/FBGEMM#589 reland D66990975 with fix for the NaN issued observed during LLaMa4 17B model run with fp8_rowwise FFN Specifically, offset was not properly updated when loading/storing data. Reviewed By: jwfromm Differential Revision: D67303282 fbshipit-source-id: 334d32019424de6daff4261b1d5ebe3c977fdabd

Summary: X-link: facebookresearch/FBGEMM#597 Pull Request resolved: pytorch#3516 added pyper configuration for mx4 goup size. Reviewed By: irobert0126, renganxu Differential Revision: D67407064 fbshipit-source-id: a23765777879491836fcb9f1a00ba8f1e1b26b76

…orch#3512) Summary: X-link: facebookresearch/FBGEMM#596 Pull Request resolved: pytorch#3512 Reviewed By: avikchaudhuri Differential Revision: D67381311 fbshipit-source-id: 345264f99d6f4b77508b4ea95fe20b3482ad1f04

Summary: X-link: facebookresearch/FBGEMM#599 - Fix the CMake minimum version in conda install - Fix issue with missing `librhash.so.0` when installing `gcc` - Fix build issues with bazel, and upgrade bazel version to latest Pull Request resolved: pytorch#3514 Reviewed By: spcyppt Differential Revision: D67435456 Pulled By: q10 fbshipit-source-id: 2fe53c59251df3633771b2b6b0d97c15a33df7b6

Summary: Pull Request resolved: pytorch#3519 X-link: facebookresearch/FBGEMM#601 For extremely large inputs, we found that boundary check values were sufficiently large that they were causing integer overflow. This resulted in triton triggering masking for all loads and stores which lead to garbage outputs. This diff fixes the issue by more carefully doing int64 upcasting for super large tensors. After this change, all super large tests pass. Reviewed By: qchip Differential Revision: D67495115 fbshipit-source-id: dcea639a7343d5782823f103a0572870aa496b05

Summary: X-link: facebookresearch/FBGEMM#602 Pull Request resolved: pytorch#3520 In the diff D66465811 we introduced a bulk initialization function `_insert_all_kv` for ssd tensors. However, large tensors take a long time to fully initialize, and ideally this can happen in the background so it doesn't increase TTFB of the training jobs. This change does exactly that, moves this initialization to a separate thread, allowing other initialization in the training job, like reading data, to happen concurrently. In order to avoid pushing synchronization to the user space, this change introduces getter and setter for ssd_db, which ensure initialization is fully done before weights are used. Reviewed By: duduyi2013, drdarshan, jiayulu Differential Revision: D67480511 fbshipit-source-id: 6faf54621fc6e26a9791ac23e48aa7890329077a

Summary: X-link: facebookresearch/FBGEMM#605 - Set the --plat-name explicitly to `manylinux_2_28` Pull Request resolved: pytorch#3521 Reviewed By: spcyppt Differential Revision: D67538191 Pulled By: q10 fbshipit-source-id: b2f8cc0b81c7e46bd2e380c03a6fa68da11786d6

Summary: Pull Request resolved: pytorch#3342 X-link: facebookresearch/FBGEMM#436 A new optional optimizer state `row_counter` is added to Adam to perform bias correction per embedding row. `row_counter` serves as the iteration counter when a row (an index) occurs and used to do bias correction. Without rowwise bias correction (existing Adam), ``` m_hat_t = m_t / (1.0 - powf(beta1, iter)); v_hat_t = v_t / (1.0 - powf(beta2, iter)); ``` With rowwise bias correction enabled. ``` // when index `idx` occurs _row_counter = row_counter[idx] + 1; m_hat_t = m_t / (1.0 - powf(beta1, _row_counter)); v_hat_t = v_t / (1.0 - powf(beta2, _row_counter)); ``` This request is from IG to allow all the models to be scaled on sparse features with expected 1.5% NE on Stories. ------- **__The functionality is not set by default.__** Frontend: D64848802 To enable the bias correction, `use_rowwise_bias_correction` needs to be set to True through extra_optimizer_config. ``` extra_optimizer_config = UserEnabledConfigDefinition(use_rowwise_bias_correction=True) emb_op = SplitTableBatchedEmbeddingBagsCodegen ( embedding_specs=[ (E, D, M, compute_device) for (E, D, M) in zip(Es, Ds, managed) ], optimizer=OptimType.Adam extra_optimizer_config=extra_optimizer_config, ... ) ``` ------ **__Performance from Kineto__** (unweighted) ``` Baseline* | default** | enabled*** forward | cpu | 2.293 s | 2.188 s | 2.043 s | cuda | 12.512 ms | 12.539 ms | 12.547 ms backward | cpu | 69.861 ms | 66.546 ms | 65.880 ms | cuda | 103.429 ms | 103.395 ms | 103.130 ms ``` \* Baseline: before changes \** default: default setting; use_bias_correction = False \*** enabled: use_bias_correction = True Reviewed By: sryap Differential Revision: D64808460 fbshipit-source-id: 9706bcc4601b370f4d67c81b833fb1cd46377a6c

avbokovoy

Could you re-target your branch so it contains only the diff you introduced, but not the changes from the upstream?

avbokovoy · 2024-12-24T14:18:04Z

fbgemm_gpu/bench/split_table_batched_embeddings_benchmark.py

@@ -1330,6 +1330,10 @@ def nbit_device(  # noqa C901
    def _kineto_trace_handler(p: profile, phase: str) -> None:
        p.export_chrome_trace(
            trace_url.format(tbe_type=tbe_type, phase=phase, ospid=os.getpid())
+        #print(p.key_averages())


Let's remove dead code

r-barnes and others added 16 commits December 16, 2024 16:42

Disable c10::optional macros in deeplearning

09293c7

Reviewed By: jwfromm Differential Revision: D67293350 fbshipit-source-id: 76ee573031729fd918cbdc0133c14f3fbbe3decf

profile with kineto to eliminate the CPU overhead in benchmark

64f8378

amirakb89 requested review from avbokovoy and liligwu December 24, 2024 07:56

avbokovoy requested changes Dec 24, 2024

View reviewed changes

amirakb89 closed this Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile with kineto #79

Profile with kineto #79

amirakb89 commented Dec 24, 2024

avbokovoy left a comment

avbokovoy Dec 24, 2024

Profile with kineto #79

Profile with kineto #79

Conversation

amirakb89 commented Dec 24, 2024

avbokovoy left a comment

Choose a reason for hiding this comment

avbokovoy Dec 24, 2024

Choose a reason for hiding this comment