Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Fix: native hashed-hash for MutableDictionaryArray #1564

Merged
merged 4 commits into from
Sep 6, 2023

Conversation

aldanor
Copy link
Contributor

@aldanor aldanor commented Sep 5, 2023

Here's yet another bugfix / refactor of mutable dictionary arrays (due to incorrect solution in #1561)... tested it on some arrays of 100+M size, this time it seems like it actually works correctly.

It was previously correct on small maps, but as soon as it would resize itself, hash consistency would vanish.

There's a somewhat flimsy part (or flimsy invariant rather) in assuming that hash of a u64 hash using HashHasherNative will be identical to the hash itself if it's built in an endian-dependent way. I've tested it on M1 (LE) and it's correct, I'm guessing it will be correct on BE as well.

Not using the hash_hasher::HashHasher directly because it's too slow on little-endian platforms as it flips the endian needlessly (for our use case, that is). We can bypass that step entirely (except for internal hashmap resizing), which avoids huge slowdown.

// @ritchie46

Fixes #1562

fn write(&mut self, value: &[u8]) {
#[cfg(target_endian = "little")]
{
for (index, item) in value.iter().enumerate().take(8) {
Copy link
Contributor Author

@aldanor aldanor Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also sort of an overkill tbh... because we know that value.len() == 8 always (or rather, "almost surely"...)

@codecov
Copy link

codecov bot commented Sep 5, 2023

Codecov Report

Patch coverage: 89.18% and project coverage change: -0.01% ⚠️

Comparison is base (87ab844) 83.02% compared to head (4ba4f1f) 83.02%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1564      +/-   ##
==========================================
- Coverage   83.02%   83.02%   -0.01%     
==========================================
  Files         391      391              
  Lines       42786    42809      +23     
==========================================
+ Hits        35523    35542      +19     
- Misses       7263     7267       +4     
Files Changed Coverage Δ
src/array/dictionary/value_map.rs 89.62% <89.18%> (-1.95%) ⬇️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@aldanor
Copy link
Contributor Author

aldanor commented Sep 6, 2023

(Forgot to clarify: the added test currently fails on main.)

@aldanor aldanor changed the title Native hashed-hash for MutableDictionaryArray Fix: native hashed-hash for MutableDictionaryArray Sep 6, 2023
@ritchie46
Copy link
Collaborator

@aldanor I believe we can make it a little bit simple by having an IdHashMap

type IdHashmap = hashbrown::HashMap<K, V, IdBuildHasher>;
type IdBuildHasher = BuildHasherDefault<IdHasher>;

where IdHasher is only ever called with u64

pub struct IdHasher {
    hash: u64,
}

impl Hasher for IdHasher {
    fn finish(&self) -> u64 {
        self.hash
    }
    #[inline]
    fn write_u64(&mut self, i: u64) {
        self.hash = i;
    }
}

And then we have a Key type:

// hash + value
#[derive(Eq, Copy, Clone)]
struct Key<T: Copy> {
    hash: u64,
    value: T,
}

impl<T: Copy> Hash for Key<T> {
    fn hash<H: Hasher>(&self, state: &mut H) {
        state.write_u64(self.hash)
    }
}

impl<T: Copy + PartialEq> PartialEq for Key<T> {
    fn eq(&self, other: &Self) -> bool {
        self.value == other.value
    }
}

And then we are responsible for initializing the Key with the hashed value and the rest is done by the hash map.

@aldanor
Copy link
Contributor Author

aldanor commented Sep 6, 2023

@ritchie46 Oh, very good point indeed, I totally forgot about the fact that write_u64() exists.

value == T is not going to work as is though, indexed values can have lifetimes (e.g. strings), and now you will need to not only link the 3 types (or rather, families of types, e.g. AsRef<str>) "things we can push", "things we can index", "connecting things we can push to things we can index", but also "things we can push or index connecting to an owned type that we can store aside that is hashable and equality-comparable to things we can push and index". The trait hierarchy is already pretty ugly... (look at AsIndexed, Indexable stopgaps).

@aldanor
Copy link
Contributor Author

aldanor commented Sep 6, 2023

(updated with passthrough hasher that uses write_u64())

@ritchie46
Copy link
Collaborator

value == T is not going to work as is though, indexed values can have lifetimes (e.g. strings),

I think we can do with using the index. That's what we do in polars for strings.

#[derive(Copy, Clone)]
struct Key {
    pub(super) hash: u64,
    pub(super) idx: u32,
}

And then the comparison is something like this:

    table.raw_entry_mut().from_hash(h, |key| {
        // first compare the hash before we incur the cache miss
        key.hash == h && {
            let idx = key.idx as usize;
            unsafe { keys.get_unchecked_release(idx).as_deref() == key_val }
        }
    })

@aldanor
Copy link
Contributor Author

aldanor commented Sep 6, 2023

@ritchie46 That's almost exactly what we already have here :) (aside from additional hash equality check which I can add too if it improves things; and with key being of K type)

// it's actually quite hilarious, I think, that I had to go through all circles of hell in the last prs here, introduce more bugs while fixing previous bugs, then optimize it while fixing them and finally arrive to pretty much precisely the polars solution 😆

@ritchie46
Copy link
Collaborator

@ritchie46 That's almost exactly what we already have here :) (aside from additional hash equality check which I can add too if it improves things; and with key being of K type)

// it's actually quite hilarious, I think, that I had to go through all circles of hell in the last prs here, introduce more bugs while fixing previous bugs, then optimize it while fixing them and finally arrive to pretty much precisely the polars solution 😆

I see now. Aside from the naming very similar indeed. :) I have left one comment about the assert. Other than that I think this is good to go. 👍

@ritchie46 ritchie46 merged commit 767834e into jorgecarleitao:main Sep 6, 2023
24 of 25 checks passed
@aldanor aldanor deleted the feature/mutable-dict-fixes branch September 6, 2023 16:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MutableDictionaryArray - another rewrite needed
2 participants