-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize CPU and Memory performance for Resize linear mode parser #3731
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution!
I have left comments for the tidy errors and concerns I have with the changes.
Please also run and ensure all our testcases are functional via building migraphx with make check
when you build.
Some of the Onnx verify tests are failing with your changes for resize. These need to be working to ensure no lose of functionality between old and new methods.
[==========] 286 tests ran
[ FAILED ] 2 tests failed
[ FAILED ] resize_upsample_linear_ac_test
[ FAILED ] resize_upsample_linear_test
The two files that seem to break are found at
test/onnx/verify/resize_upsample_linear_test.cpp
test/onnx/verify/resize_upsample_linear_ac_test.cpp
Also please ensure your changes meet format as outlined from here
https://github.com/ROCm/AMDMIGraphX/actions/runs/12454557312/job/34765947690?pr=3731
@TedThemistokleous, Why is this PR showing up on |
External contributor I think? I believe its should be the same repo the commit is going to. |
0e22cd0
to
a300e32
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3731 +/- ##
===========================================
- Coverage 92.16% 92.15% -0.01%
===========================================
Files 515 515
Lines 21978 21977 -1
===========================================
- Hits 20256 20254 -2
- Misses 1722 1723 +1 ☔ View full report in Codecov by Sentry. |
a300e32
to
006eae7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking a stab at cleaning this function. However, this PR needs two additional things (besides all test-cases to pass):
-
Code comments for
calc_neighbor_points
. And this is not the fault of this PR, but the comments were/are entirely missing, yet this function is very complicated for a reviewer's understanding, and I am not sure what exactly is it doing. And that documentation needs to be fixed now, since this function is being rewritten. -
A unit-test to specifically test this extremely complex function. It should have been there much earlier, but it can be added now in this PR.
Here's the explanation of the new algorithm: 1-dim is n_dims, example n_dims = 4; What the original calc_neighbor_points() algorithm is trying to do, is to compose a new vector, with size (2^n_dims * out_elements), within each is a n_dim vector of integer. In current example, if out_elements=16, n_dims=4, then we'll have a new vector as below: vec_ind{} = { Let's re-write vvv_ind in different pattern for friendly understanding, where each character is a vector of 16 elements, like A = {0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1}. AB The original calc_neighbor_points() is done in a recursive way that append elements in vertical, and expand in horizontal. Each 16-element vector will be transposed, notated as A': Recursion: Pass 1: Pass 2: Pass 3: Pass 4, last time, with the crafted vector, 2^4 * 16 elements, each element is a vector of 4 integer as the index, to get the index from shape in_s.index(idx) Since the 2nd dimension is hardcoded to 2, we can treat this dimension as binary. What the final result (before in_s.index(idx)) is actaully we increase from 0 to 2^n_dim, and convert this n-dim bits value to binary, using the bit to index into 2nd-dim of vvv_ind (hi or low), use the position in n-dim to index into 1st-dim of vvv_ind, and loop out_elements so that all elements in A (and other capital character) can be looped. Taking Pass 3 (before the final in_s.index(idx)) to explain: vec_ind{} Pass 3: |
cc2d6a0
to
92ef252
Compare
src/onnx/parse_resize.cpp
Outdated
dim.push_back(i); | ||
return dim; | ||
}); | ||
throw std::runtime_error("Shape dimension " + std::to_string(n_bits) + " exceeds " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the MIGRAPHX_THROW
macro to throw the exception. Also, prefix it with the onnx operator name(usually they make it all uppercase like RESIZE:
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in next push.
return dim; | ||
}); | ||
std::bitset<std::numeric_limits<std::size_t>::digits> bits_val = val; | ||
std::vector<std::size_t> indices(n_bits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a commit missing? This should be std::array<std::size_t, std::numeric_limits<std::size_t>::digits> indices;
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed you point. Are you suggesting an array of fixed size std::numeric_limitsstd::size_t::digits instead of vector? Actually indices doesn't have to take 64/32 length long, n_bits is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pfultz2 need your comment, I'd like to make these changes in one-shot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pfultz2 Not update this part. Need your comments.
Re-write calc_neighbor_points() by composing index from binary bits instead of recursion. With the optimized calc_neighbor_points(), CPU time required by 90% and peak memory utilization is significantly reduced. Perf. comparision on VM w/ 12-Core EPYC 9V64 + 128 GB mem: n_dim out_elements New t-CPU (us) Old t-CPU (us) t-CPU Ratio ------- -------------- ---------------- ---------------- ------------- 4 786,432 170,377 1,878,299 0.0907 4 1,572,864 383,125 4,009,335 0.0956 4 3,145,728 784,388 7,670,960 0.1023 4 6,291,456 1,567,753 15,095,017 0.1039 4 12,582,912 3,139,452 29,622,921 0.1060 4 25,165,824 6,266,153 58,332,233 0.1074 4 50,331,648 12,517,674 116,923,368 0.1071 4 100,663,296 25,011,425 OOM Kill N/A Signed-off-by: Colin Xu <Colin.Xu@amd.com>
Revise based on reviewer comments. Signed-off-by: Colin Xu <Colin.Xu@amd.com>
Revise implemenation based on reviewer comments. Update performance comparison accordingly. +-------+--------------+----------------+----------------+-------------+ | n_dim | out_elements | New t-CPU (us) | Old t-CPU (us) | t-CPU Ratio | +-------+--------------+----------------+----------------+-------------+ | 4 | 786432 | 120405 | 1494350 | 0.0806 | | 4 | 1572864 | 282763 | 3826060 | 0.0739 | | 4 | 3145728 | 650957 | 7941436 | 0.0820 | | 4 | 6291456 | 1304652 | 14869059 | 0.0877 | | 4 | 12582912 | 2608523 | 29432326 | 0.0886 | | 4 | 25165824 | 5175560 | 58848631 | 0.0879 | | 4 | 50331648 | 10486676 | 118005802 | 0.0889 | | 4 | 100663296 | 21141464 | OOM Kill | N/A | +-------+--------------+----------------+----------------+-------------+ Signed-off-by: Colin Xu <Colin.Xu@amd.com>
Revise based on reviewer comments. Rebase to develop HEAD. Signed-off-by: Colin Xu <Colin.Xu@amd.com>
92ef252
to
772862d
Compare
@pfultz2 let me know if the latest version resolve all review comments. |
Re-write calc_neighbor_points() by composing index from binary bits instead of recursion.
With the optimized calc_neighbor_points(), CPU time required by 90% and peak memory utilization is significantly reduced.
Perf. comparision on VM w/ 12-Core EPYC 9V64 + 128 GB mem: