-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add block key masking #11
base: master
Are you sure you want to change the base?
Conversation
Can you post the generated code here? AVX/2 are more lax on memory alignment requirements but you can still get some crashes from that if the compiler decides to use vmovaps or similar |
}; | ||
|
||
union KeyMask { | ||
std::array<u16, 32> words{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit scary and I'm not sure if it's 100% standard-legal, might be better to use a constructor?
Here's a godbolt link with a demo https://godbolt.org/z/xP59G38bE |
Ignore the fact it has the entire robin map headers, wanted to compare against the map lookup |
I do not see anything wrong with that and it seems to run fine. Can you try marking the keys as alignas(32) maybe? Though the godbolt link doesn't have any alignment-sensitive instructions... |
Can you test locally on windows in case this is a linux specific issue? |
Thanks for conforming, will checkout what asm this generates here for clang xD |
Having so much state in the block key leads to cases where a block is compiled multiple times even though the actual emitted code is the same. Not every block uses all of the state provided by the key. To address this each time a block is compiled a mask of used registers is also emitted. This matches well with the switch to the lookup table as the mask is inherently location specific.
Block key state is accessed through methods that automatically mask the appropriate member, so this wont break in the future if we forget to manually mask during block generation. It leads to about a 5x reduction in compiled blocks (from 6000 to 1500 during first 10 seconds if 3D Land intro in Surround Sound mode). The max lookup iteration also dropped from 15 to 7 meaning the dispatcher is now faster too.
Initially I wanted to optimize the key matching using AVX2 intrinsics like so, which generates very good asm, but it seems to cause crashes most of the time launching which I'm not sure why