Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add block key masking #11

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Add block key masking #11

wants to merge 1 commit into from

Conversation

raphaelthegreat
Copy link
Owner

@raphaelthegreat raphaelthegreat commented Mar 13, 2024

Having so much state in the block key leads to cases where a block is compiled multiple times even though the actual emitted code is the same. Not every block uses all of the state provided by the key. To address this each time a block is compiled a mask of used registers is also emitted. This matches well with the switch to the lookup table as the mask is inherently location specific.

Block key state is accessed through methods that automatically mask the appropriate member, so this wont break in the future if we forget to manually mask during block generation. It leads to about a 5x reduction in compiled blocks (from 6000 to 1500 during first 10 seconds if 3D Land intro in Surround Sound mode). The max lookup iteration also dropped from 15 to 7 meaning the dispatcher is now faster too.

Initially I wanted to optimize the key matching using AVX2 intrinsics like so, which generates very good asm, but it seems to cause crashes most of the time launching which I'm not sure why

bool KeyCompare(const Key& lhs, const Key& rhs, u256 mask_lo, u256 mask_hi) {
    const u256* lhsp = (const u256*)&lhs;
    const u256* rhsp = (const u256*)&rhs;
    const u256 l = _mm256_cmpeq_epi16(_mm256_and_si256(lhsp[0], mask_lo), rhsp[0]);
    const u256 r = _mm256_cmpeq_epi16(_mm256_and_si256(lhsp[1], mask_hi), rhsp[1]);
    return (_mm256_movemask_epi8(l) & _mm256_movemask_epi8(r)) == std::numeric_limits<u32>::max();
}

@wheremyfoodat
Copy link
Collaborator

Can you post the generated code here? AVX/2 are more lax on memory alignment requirements but you can still get some crashes from that if the compiler decides to use vmovaps or similar

};

union KeyMask {
std::array<u16, 32> words{};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit scary and I'm not sure if it's 100% standard-legal, might be better to use a constructor?

@raphaelthegreat
Copy link
Owner Author

Here's a godbolt link with a demo https://godbolt.org/z/xP59G38bE

@raphaelthegreat
Copy link
Owner Author

Ignore the fact it has the entire robin map headers, wanted to compare against the map lookup

@wheremyfoodat
Copy link
Collaborator

wheremyfoodat commented Mar 13, 2024

I do not see anything wrong with that and it seems to run fine. Can you try marking the keys as alignas(32) maybe? Though the godbolt link doesn't have any alignment-sensitive instructions...

@raphaelthegreat
Copy link
Owner Author

Can you test locally on windows in case this is a linux specific issue?

@wheremyfoodat
Copy link
Collaborator

Works here, tested with several games
image

@wheremyfoodat
Copy link
Collaborator

wheremyfoodat commented Mar 13, 2024

Locally generated asm. I'd look out in case that vmovdqu is a vmovdqa on your setup
image

@raphaelthegreat
Copy link
Owner Author

Thanks for conforming, will checkout what asm this generates here for clang xD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants