Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fewer intrinsics in _mm_movemask_ps #653

Closed
wants to merge 1 commit into from

Conversation

shinjiogaki
Copy link

No description provided.

@aqrit
Copy link
Contributor

aqrit commented Nov 10, 2024

an alternative is to extract bits with using umulh:

uint64_t x = vget_lane_s64(vreinterpret_s64_s16(vqmovn_s32(vreinterpretq_s32_m128i(a))),0);
const uint64_t mask = 0x8000800080008000ULL;
const uint64_t magic = 0x0002000400080010ULL;
return (uint8_t)(((__uint128_t)(x & mask) * (__uint128_t)magic) >> 64);

I don't know which is faster.

@howjmay
Copy link
Collaborator

howjmay commented Nov 12, 2024

@shinjiogaki are you interested in giving a try to see whether this bring performance enhancement?

@aqrit
Copy link
Contributor

aqrit commented Nov 13, 2024

edit: this is wrong... shrn grabs the low two bytes not the top 2
I haven't tested it but... I'm pretty sure the following would be the best __aarch64__ implementation:

uint64_t x = vget_lane_s64(vreinterpret_s64_s16(vshrn_n_s32(vreinterpretq_s32_m128i(a), 15)), 0);
return (x * 0x0F001F003F008000) >> 60;

magic number was adapted from https://zeux.io/2022/09/02/vpexpandb-neon-z3/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants