Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow: Check compatibility of pyarrow.array with string type #2933

Merged
merged 30 commits into from
Nov 15, 2024

Conversation

weiji14
Copy link
Member

@weiji14 weiji14 commented Dec 30, 2023

Description of proposed changes

Check that pyarrow.array objects of string type can be passed to PyGMT modules/functions.

This PR also adds type hints to the docstring of the pygmt.Figure.text's text parameter.

Preview at https://pygmt-dev--2933.org.readthedocs.build/en/2933/api/generated/pygmt.Figure.text.html

TODO:

  • Initial compatibility check of pyarrow.array by adding parametrized tests to pygmt.text
  • Add parametrized test to test_clib_virtualfile_from_vectors.py
  • Check that pa.StringArray can be passed into pygmt.Figure.text

Addresses #2800 (comment), part of #2800.

Reminders

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst.
  • Write detailed docstrings for all functions/methods.
  • If wrapping a new module, open a 'Wrap new GMT module' issue and submit reasonably-sized PRs.
  • If adding new functionality, add an example to docstrings or tutorials.
  • Use underscores (not hyphens) in names of Python files and directories.

Slash Commands

You can write slash commands (/command) in the first line of a comment to perform
specific operations. Supported slash commands are:

  • /format: automatically format and lint the code
  • /test-gmt-dev: run full tests on the latest GMT development version

Ensure that pyarrow.array objects with string type can be read by pygmt.text.
@weiji14 weiji14 added the maintenance Boring but important stuff for the core devs label Dec 30, 2023
@weiji14 weiji14 added this to the 0.11.0 milestone Dec 30, 2023
@weiji14 weiji14 self-assigned this Dec 30, 2023
@seisman seisman modified the milestones: 0.11.0, 0.12.0 Feb 1, 2024
@seisman seisman removed this from the 0.12.0 milestone Feb 26, 2024
Convert a pyarrow.StringArray via a Python list to a ctypes array in the strings_to_ctypes_array function. Updated docstrings and type hints in `clib.Session.put_strings` method and `clib.conversion.strings_to_ctypes_array` function. Added two parametrized unit tests to ensure that pyarrow.StringArray can be passed into the clib methods.
@weiji14 weiji14 changed the title pyarrow: Check compatibility of pyarrow.array with string type pyarrow: Support pyarrow array with string type as input Oct 11, 2024
pygmt/clib/conversion.py Outdated Show resolved Hide resolved
@@ -263,14 +267,15 @@ def sequence_to_ctypes_array(
return (ctype * size)(*sequence)


def strings_to_ctypes_array(strings: Sequence[str]) -> ctp.Array:
def strings_to_ctypes_array(strings: Sequence[str] | pa.StringArray) -> ctp.Array:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to figure out how to fix the mypy type errors for an optional import 🤔

pygmt/clib/conversion.py:298: error: Item "Sequence[str]" of "Sequence[str] | Any" has no attribute "to_pylist"  [union-attr]
pygmt/clib/session.py:1004: error: Item "Sequence[str]" of "Sequence[str] | Any" has no attribute "dtype"  [union-attr]

pygmt/clib/conversion.py Outdated Show resolved Hide resolved
@weiji14 weiji14 added the run/benchmark Trigger the benchmark workflow in PRs label Oct 11, 2024
So that the pyarrow.StringArray type hint will show up on Readthedocs and the PyGMT docs page.
@weiji14 weiji14 added this to the 0.14.0 milestone Oct 11, 2024
Copy link

codspeed-hq bot commented Oct 11, 2024

CodSpeed Performance Report

Merging #2933 will not alter performance

Comparing pyarrow/string (ccf4eff) with main (40edcf7)

Summary

✅ 99 untouched benchmarks

🆕 8 new benchmarks
⁉️ 2 dropped benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main pyarrow/string Change
⁉️ test_put_strings 3.5 ms N/A N/A
🆕 test_put_strings[pyarrow] N/A 3.5 ms N/A
🆕 test_put_strings[str] N/A 3.5 ms N/A
🆕 test_virtualfile_from_vectors_one_string_or_object_column[object] N/A 8.4 ms N/A
🆕 test_virtualfile_from_vectors_one_string_or_object_column[pyarrow] N/A 8.5 ms N/A
🆕 test_virtualfile_from_vectors_one_string_or_object_column[str] N/A 8.4 ms N/A
⁉️ test_text_multiple_lines_of_text 14.2 ms N/A N/A
🆕 test_text_multiple_lines_of_text[list] N/A 14.2 ms N/A
🆕 test_text_multiple_lines_of_text[numpy] N/A 14.2 ms N/A
🆕 test_text_multiple_lines_of_text[pyarrow] N/A 14.3 ms N/A

@weiji14 weiji14 removed the run/benchmark Trigger the benchmark workflow in PRs label Oct 11, 2024
.github/workflows/ci_docs.yml Outdated Show resolved Hide resolved
pygmt/clib/conversion.py Outdated Show resolved Hide resolved
weiji14 and others added 2 commits October 11, 2024 20:40
Faster for longer string arrays.

Co-authored-by: Dongdong Tian <seisman.info@gmail.com>
Still need to work on Duration and GeoArrow geometry types.
@weiji14 weiji14 added the run/benchmark Trigger the benchmark workflow in PRs label Oct 11, 2024
Accidentally removed from `test_virtualfile_from_vectors_one_string_or_object_column` during merge conflict handling.
@seisman
Copy link
Member

seisman commented Nov 15, 2024

With PR #3619 and #3608 merged, pyarrow arrays with string type now can be correctly converted to numpy string array. So, we don't need to add support of pyarrow arrays in the low-level functions like Session.put_strings. Changes in pygmt/_typing.py, pygmt/clib/conversion.py and pygmt/clib/session.py can be reverted, so that this PR only focuses on adding tests for pyarrow string arrays.

@@ -8,12 +8,31 @@
from pygmt import clib
from pygmt.exceptions import GMTCLibError
from pygmt.helpers import GMTTempFile
from pygmt.helpers.testing import skip_if_no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to revert changes in this PR, since the low-level function Session.put_strings only accepts numpy.str_ and everything else should be converted to numpy.str_ by the _to_numpy function before calling put_strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, just to document things, the way we are intending pyarrow.string inputs to go through high-level modules like fig.text is:

  1. pygmt.Figure.text -> clib.Session.virtualfile_in -> clib.Session.virtualfile_from_vectors
  2. In clib.Session.virtualfile_from_vectors:
    • First call clib.conversion.vectors_to_arrays -> clib.conversion._to_numpy -> np.ascontiguous_array to convert pyarrow.string to np.str_
    • With a list[np.str_], this is then passed to clib.Session.put_strings

For the record, the original intention in this PR, mostly in commit d379e46, was to let clib.Session.put_strings handle both np.str_ and pyarrow.string.

@weiji14 weiji14 changed the title pyarrow: Support pyarrow array with string type as input pyarrow: Check compatibility of pyarrow.array with string type Nov 15, 2024
@@ -137,3 +143,37 @@ def test_open_virtual_file():
bounds = "\t".join([f"<{col.min():.0f}/{col.max():.0f}>" for col in data.T])
expected = f"<matrix memory>: N = {shape[0]}\t{bounds}\n"
assert output == expected


@pytest.mark.benchmark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was moved to test_clib_virtualfile_from_vectors in #3512.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved in 6ad6eb9

Include pyarrow.StringArray besides Sequence[str] or np.ndarray.
The test_virtualfile_from_vectors_one_string_or_object_column unit test was moved to test_clib_virtualfile_from_vectors.py. Xref #3512
Comment on lines 61 to 66
pytest.param(
getattr(pa, "array", None),
{}, # pa.string()
marks=skip_if_no(package="pyarrow"),
id="pyarrow",
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible solution to avoid the complicated param is:

try:
    import pyarrow as pa
    pa_array = pa.array
except ImportError:
    pa_array = None

@pytest.mark.parametrize(
    ("array_func", "dtype"),
    [
        pytest.param(np.array, {"dtype": np.str_}, id="str"),
        pytest.param(np.array, {"dtype": np.object_}, id="object"),
        pytest.param(pa_array, {}, marks=skip_if_no(package="pyarrow"), id="pyarrow"),
    ]
)

Copy link
Member Author

@weiji14 weiji14 Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done at 85d2bb5 8172102

weiji14 and others added 2 commits November 15, 2024 15:15
@weiji14 weiji14 marked this pull request as ready for review November 15, 2024 02:33
@seisman
Copy link
Member

seisman commented Nov 15, 2024

It's likely that pytest-codspeed v3.0 breaks our Benchmark workflow. For comparison, see the CI job on the main branch yesterday (https://github.com/GenericMappingTools/pygmt/actions/runs/11830721578/job/32964716012) and this PR (https://github.com/GenericMappingTools/pygmt/actions/runs/11849193208/job/33022039738?pr=2933).

Edit: Will explore the issue in a separate PR.

@seisman seisman added skip-changelog Skip adding Pull Request to changelog and removed run/benchmark Trigger the benchmark workflow in PRs skip-changelog Skip adding Pull Request to changelog labels Nov 15, 2024
@seisman seisman merged commit d982275 into main Nov 15, 2024
20 of 27 checks passed
@seisman seisman deleted the pyarrow/string branch November 15, 2024 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants