Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allows concatenation to ignore validation and adds method on table to… #63

Merged
merged 5 commits into from
Sep 17, 2024

Conversation

akoumjian
Copy link
Contributor

… separate valid from invalid rows

@akoumjian akoumjian requested a review from moeyensj September 17, 2024 14:37
Comment on lines +967 to +970
mask = np.zeros(num_rows, dtype=bool)
for name, validator in self._column_validators.items():
indices, _ = validator.failures(self.table.column(name))
mask[indices.to_numpy()] = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy just makes things so easy sometimes. I was curious to see how you were going to do this with pyarrow compute functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented it both ways and this appeared the more efficient (both compute and memory) of the two.

@@ -1116,7 +1140,7 @@ def _encode_attr_dict(cls, attrs: dict[str, Any]) -> dict[bytes, bytes]:
result[k.encode("utf8")] = descriptor.to_bytes(pytyped)
return result

def apply_mask(self, mask: pa.BooleanArray | np.ndarray[bool, Any] | list[bool]) -> Self:
def apply_mask(self, mask: pa.BooleanArray | npt.NDArray[np.bool_] | list[bool]) -> Self:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, there are already tests for these three types:

quivr/test/test_tables.py

Lines 1051 to 1080 in f7c92ae

def test_apply_mask_numpy():
values = Pair.from_kwargs(x=[1, 2, 3], y=[4, 5, 6])
mask = np.array([True, False, True])
have = values.apply_mask(mask)
np.testing.assert_array_equal(have.x, [1, 3])
def test_apply_mask_pylist():
values = Pair.from_kwargs(x=[1, 2, 3], y=[4, 5, 6])
mask = [True, False, True]
have = values.apply_mask(mask)
np.testing.assert_array_equal(have.x, [1, 3])
def test_apply_mask_pyarrow():
values = Pair.from_kwargs(x=[1, 2, 3], y=[4, 5, 6])
mask = pa.array([True, False, True], pa.bool_())
have = values.apply_mask(mask)
np.testing.assert_array_equal(have.x, [1, 3])
def test_apply_mask_wrong_size():
values = Pair.from_kwargs(x=[1, 2, 3], y=[4, 5, 6])
mask = [True, False]
with pytest.raises(ValueError):
values.apply_mask(mask)

@akoumjian akoumjian merged commit 8aafd94 into main Sep 17, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants