-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]: Support fast insert operations for single set/map #475
Comments
Not sure if this is too hacky but couldn't we just instantiate the map with |
Even if all the hash values are distinct, you could still end up with collisions when you do a modulo with the size the storage, right? |
@jrhemstad is correct, which is why the test involves checking for the max of the hash. In my vision, the test for whether it is okay to skip using atomics is that the hash function's I think that it's not a good idea to introduce new functions as above because they will not be easily found. I think that a better way could be to add the For hasher's which do not implement |
That's a brilliant idea. I don't see any big issues for now apart from the "hacky" fact that using thread scope is equivalent to no atomics involved in hash table internals. |
Using the scope to eliminate atomics requires the caller to know whether or not atomics can be eliminated. I don't think that is a good idea. I don't think that we should require the caller to know about the internals of how the hashing is implemented. It will not be discoverable by users and, if we should change how hashing works in the future, it will break clients. I would prefer for the hash function to know whether or not it is perfect and what its maximum is and then let the hash table determine whether or not atomics should be used. This is a better separation of concerns. |
Yes, that's my point. The philosophy is that cuco provides only the core functionality needed for doing hash table building/probing but exposes a set of customization points for users to tune their implementation. Like in your case you can introspect the data, which gives you the ability to define a custom hasher (customization point object), and also lets you determine whether or not atomic updates are required ( I totally see the benefit of your approach, and I'm still impressed by the implementation. The question I'm asking myself is if this should go directly into cuco or rather a separate utility/helper repo. |
I also don't know. But maybe we don't have to solve everything all at once. cuCo already provides multiple hash function. What about providing the "significant bits" hasher? If we can get it working and it proves to be pretty fast, maybe add it to the rest of the hash functions? |
I'm interested in this idea. This hasher is also introspecting the data in order to determine the significant bits, right? If we can press this implementation into a hasher that is consumable by cuco we can give it a try. Maybe open a GH so we can track this discussion separately. |
@esoha-nvidia Daniel and I discussed this offline and we realized that in the case of perfect hashing, a normal hash table like vector[identity_hash(key)] = payload; // insert
auto const found = vector[identity_hash(query_key)]; // find On top of non-atomic insertions, it also gets rid of the expensive probing mechanism thus should be more effective than any hash tables. |
It would be, in my opinion, nicer if cuCo could determine for itself if the hashing is perfect and then do all that behind the scenes. The advantage would be that the user could swap out a regular hash for a perfect hash with ease, by just selecting a different hash function, instead of needing to rewrite so much. Also, users might still want things like In the same way that cuCo has optimizations when the key is of a certain bitwidth, it could also have optimizations for when the hash has certain properties (like being perfect, for example). |
This would be nice to have if we could figure out a way other than doing data introspection during hash construction. A perfect hash function not only affects the hasher itself but also changes the way that probes the container. With that, instead of fiddling around hasher and doing STL-unfriendly operations, I'm considering exposing a new probing scheme called, e.g.
Totally makes sense It would be valuable if you could benchmark the speedups brought by the device vector lookup table (#475 (comment)) in your case. It represents the SOL performance of perfect hashing IMO |
That's a great idea! The signature could look like this: template <class Hash>
class perfect_probing {
static constexpr int32_t cg_size = 1; // There's no point in using a CG size larger than 1
// ...
}; The probing iterator returned by the scheme's A user must ensure that they pass a perfect hash function for the given storage capacity; otherwise, behavior is undefined. With this probing scheme, we eliminate the unnecessary complex probing code path and also omit the expensive modulo operator. |
Perfect hashing without data introspection? That should be impossible unless your hash function is the identity function, right? Anyway, it is not the concern of cuCo which hash function the user chooses. cuCo should be able to work with any valid hash function! A new probing scheme seems like a good solution. Should it also affect the equality operation? Under perfect hashing, you don't need to check for equality, right? The three things that the user must do in order to elide the use of atomics:
The user can ensure the first two but what about the third? Maybe cuCo can provide a function for this? BenchmarksHere we compare a groupby operation with and without perfect hashing: The groupby operation in SQL, which uses hashing, has improved from 7.448ms to 5.351ms, an improvement of 2.097ms which is 39% faster overall. Examining just the insert, we see that it has gone from 5.060ms to 3.072ms, which is basically the vast majority of that 2.097ms. (The rest is just noise in measurement.) This is a 64.7% improvement in hash insertion time. Here is it for lookup: The version with atomics has perfect hashing disabled and takes 943.444us. The one without atomics has perfect hashing enabled and takes 817.397us. Here the advantage comes not from eliding the atomics, which are not used during lookup, but the ability to elide the equality operator. The equality operators requires a dereference and reading of additional memory when cuCo is used in "row hasher" mode, which is what my code is does and also what cuDF does. This is a more modest gain of just 15%. |
BTW, one problem with this approach is that it is a trick. The need for this is going to be non-obvious to users. And you're depending on the fact that Just put in each probing scheme a method It goes without saying that all this must be a branch at compile time, right? The insert kernel should be templated on whether or not there is perfect hashing. You can't do that branch in the kernel or it will rob you of the performance gains! |
Agreed. cuco should take care of the thread scope internally for perfect hashing. Also, considering |
Right, that's another thing we have to figure out. In cuco we pass the
Currently, cuco requires There's one more problem which I'm trying to wrap my head around. If the user specifies |
Oh, one more thing I forgot to mention: How do we handle heterogeneous insert/lookup (blog post) in this case? We should disable it altogether, right? There are so many things to look out for to get this feature going -.- |
|
I don't think that you need key equal during the lookup. If the slot is not empty, it's a match. Heterogeneous lookup should work fine. The data is unaffected by the removal of the use of atomics, right? We still read and write the same bits, we just do it without atomics because we have confidence that there will be no collisions. |
💡 Oh, perfect hashing is even more perfect than I thought |
Yes it is! Perfect hashing means that |
I added a proposal for the perfect hashing probing scheme in #487. Let me know if I missed anything. |
Closing this as surpassed by #487 |
Is your feature request related to a problem? Please describe.
Currently, cuco hash tables always use atomic CAS to for table insertions.
This could introduce unnecessary costs in the case of perfect hashing where we know the hash values of all inputs are distinct and hash collisions would never happen.
Describe the solution you'd like
Inspired by @esoha-nvidia's comments on perfect hashing, we could pass an
is_perfect
/is_perfect_hashing
argument toinsert
functions thus andis_perfect == true
, the input element is copied to the target bucket directly. Otherwise, it's atomically CASed to the desired bucket.Describe alternatives you've considered
Apart from adding an extra parameter
is_perfect
to existing APIs, we could also introduce a new API instead.The host-bulk API could look like:
The device API:
Additional context
Would perfect hashing be a valid use case for multset/multimap as well? To avoid jumping into the black hole directly, we probably just focus on
static_set
/static_map
for now.The text was updated successfully, but these errors were encountered: