Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib.conversion._to_numpy: Add tests for pandas.Series with pandas string dtype #3607

Merged
merged 8 commits into from
Nov 15, 2024

Conversation

seisman
Copy link
Member

@seisman seisman commented Nov 10, 2024

Description of proposed changes

Add tests for pandas.Series with string dtype. Six cases are tested:

  1. dtype=None
  2. dtype=np.str_
  3. dtype="U10"
  4. dtype="string[python]"
  5. dtype="string[pyarrow]"
  6. dtype="string[pyarrow_numpy]"

Neither can be converted to np.str_ directly. Cases 4-6 can be fixed by 01ba317, and cases 1-3 can be fixed by dac7e8e.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: x = pd.Series(["abc", "defg", "12345"], dtype=None)

In [4]: x.dtype
Out[4]: dtype('O')

In [5]: np.ascontiguousarray(x)
Out[5]: array(['abc', 'defg', '12345'], dtype=object)

In [6]: x = pd.Series(["abc", "defg", "12345"], dtype=np.str_)

In [7]: x.dtype
Out[7]: dtype('O')

In [8]: np.ascontiguousarray(x)
Out[8]: array(['abc', 'defg', '12345'], dtype=object)

In [9]: x = pd.Series(["abc", "defg", "12345"], dtype="U10")

In [10]: x.dtype
Out[10]: dtype('O')

In [11]: x = pd.Series(["abc", "defg", "12345"], dtype="string[python]")

In [12]: x.dtype
Out[12]: string[python]

In [13]: str(x.dtype)
Out[13]: 'string'

In [14]: np.ascontiguousarray(x)
Out[14]: array(['abc', 'defg', '12345'], dtype=object)

In [15]: x = pd.Series(["abc", "defg", "12345"], dtype="string[pyarrow]")

In [16]: x.dtype
Out[16]: string[pyarrow]

In [17]: str(x.dtype)
Out[17]: 'string'

In [18]: np.ascontiguousarray(x)
Out[18]: array(['abc', 'defg', '12345'], dtype=object)

In [19]: x = pd.Series(["abc", "defg", "12345"], dtype="string[pyarrow_numpy]")

In [20]: x.dtype
Out[20]: string[pyarrow_numpy]

In [21]: str(x.dtype)
Out[21]: 'string'

In [22]: np.ascontiguousarray(x)
Out[22]: array(['abc', 'defg', '12345'], dtype=object)

@seisman seisman added the maintenance Boring but important stuff for the core devs label Nov 10, 2024
@seisman seisman added this to the 0.14.0 milestone Nov 10, 2024
@seisman seisman added needs review This PR has higher priority and needs review. skip-changelog Skip adding Pull Request to changelog labels Nov 10, 2024
@@ -1475,7 +1475,7 @@ def virtualfile_from_vectors(
# 2 columns contains coordinates like longitude, latitude, or datetime string
# types.
for col, array in enumerate(arrays[2:]):
if pd.api.types.is_string_dtype(array.dtype):
if np.issubdtype(array.dtype, np.str_):
Copy link
Member Author

@seisman seisman Nov 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in this PR were initially added in #684 to support string arrays with np.object_ dtype. It's no longer necessary after dac7e8e because the array has been processed by the _to_numpy function when calling vectors_to_arrays at line 1471.

@@ -1506,9 +1506,9 @@ def virtualfile_from_vectors(
strings = string_arrays[0]
elif len(string_arrays) > 1:
strings = np.array(
[" ".join(vals) for vals in zip(*string_arrays, strict=True)]
[" ".join(vals) for vals in zip(*string_arrays, strict=True)],
dtype=np.str_,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying dtype is not necesary here, but I feel it's good to expicitly tell that here we're expecting a np.str_ array.

@seisman seisman marked this pull request as ready for review November 10, 2024 08:53
@seisman seisman added needs review This PR has higher priority and needs review. and removed needs review This PR has higher priority and needs review. labels Nov 10, 2024
@michaelgrund michaelgrund added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels Nov 12, 2024
@@ -175,6 +179,11 @@ def _to_numpy(data: Any) -> np.ndarray:
else:
vec_dtype = str(getattr(data, "dtype", ""))
array = np.ascontiguousarray(data, dtype=dtypes.get(vec_dtype))

# Check if a np.object_ array can be converted to np.str_.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary to support pd.Series string like:

x = pd.Series(["abc", "defg", "12345"], dtype=None)
x = pd.Series(["abc", "defg", "12345"], dtype=np.str_)
x = pd.Series(["abc", "defg", "12345"], dtype="U10")

)
strings = np.asanyarray(a=strings, dtype=np.str_)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this line was added in PR #684 and is no longer needed after dac7e8e

@seisman seisman requested a review from weiji14 November 15, 2024 03:14
@seisman seisman merged commit 3d08919 into main Nov 15, 2024
17 of 21 checks passed
@seisman seisman deleted the to_numpy/pandas_string branch November 15, 2024 04:02
@seisman seisman removed the final review call This PR requires final review and approval from a second reviewer label Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs skip-changelog Skip adding Pull Request to changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants