Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Arrow PyCapsule Interface #3568

Open
kylebarron opened this issue Sep 4, 2024 · 7 comments
Open

Support Arrow PyCapsule Interface #3568

kylebarron opened this issue Sep 4, 2024 · 7 comments
Labels
enhancement needs-info vega: vegafusion Requires upstream/integration action w/ `vegafusion`

Comments

@kylebarron
Copy link

kylebarron commented Sep 4, 2024

What is your suggestion?

👋 The Arrow project recently created the Arrow PyCapsule Interface, a new protocol for sharing Arrow data in Python. Among its goals is allowing Arrow data interchange without requiring the use of pyarrow, but I'm also excited about the prospect of an ecosystem that can share data only by the presence of dunder methods, where producer and consumer don't have to have prior knowledge of each other.

This would allow Altair to work out of the box with any Arrow-based object that supports this interface.

I've been working to promote the PyCapsule Interface across the ecosystem, with many libraries having adopted support so far.

Given that altair already has an optional dependency on pyarrow, the easiest implementation would be a simple addition in here:

altair/altair/utils/data.py

Lines 417 to 434 in 5207768

def arrow_table_from_dfi_dataframe(dfi_df: DataFrameLike) -> pa.Table:
"""Convert a DataFrame Interchange Protocol compatible object to an Arrow Table."""
import pyarrow as pa
# First check if the dataframe object has a method to convert to arrow.
# Give this preference over the pyarrow from_dataframe function since the object
# has more control over the conversion, and may have broader compatibility.
# This is the case for Polars, which supports Date32 columns in direct conversion
# while pyarrow does not yet support this type in from_dataframe
for convert_method_name in ("arrow", "to_arrow", "to_arrow_table", "to_pyarrow"):
convert_method = getattr(dfi_df, convert_method_name, None)
if callable(convert_method):
result = convert_method()
if isinstance(result, pa.Table):
return result
pi = import_pyarrow_interchange()
return pi.from_dataframe(dfi_df)

to first call

if hasattr(dfi_df, "__arrow_c_stream__"):
    # pa.table() will automatically check for `__arrow_c_stream__` and call that
    # todo: add pyarrow version check; I forget which version added support for the PyCapsule Interface
    return pa.table(dfi_df)

This would also allow you to remove your polars-specific code, because polars implements the Arrow PyCapsule interface (pola-rs/polars#17676) as of Polars 1.3.

Alternatively, this interface would enable you to accept Arrow input data without a pyarrow dependency, if that's attractive.

I figure @MarcoGorelli also has opinions about this given #3445. Narwhals also supports PyCapsule Interface export: narwhals-dev/narwhals#786.

Have you considered any alternative solutions?

Altair already supports the DataFrame Interchange Protocol, but that is not a direct replacement for the PyCapsule Interface. The PyCapsule Interface is much easier to implement for Arrow-based libraries and allows zero-copy data exchange with very little overhead. There are many libraries that would implement the PyCapsule Interface without wanting to go through the trouble of implementing the DataFrame Interchange Protocol.

Also relevant is that vegafusion is planning to adopt this, notwithstanding a Rust technical issue vega/vegafusion#501

@MarcoGorelli
Copy link
Contributor

Hi @kylebarron !

I'm certainly interested in doing what I can to facilitate this

This would also allow you to remove your polars-specific code, because polars implements the Arrow PyCapsule interface (pola-rs/polars#17676) as of Polars 1.3.

Could you show how this would work please? There's only one teeny-tiny Polars-specific piece of code in Altair, and it's not clear to me how the Arrow C Interface would address it, but I might be missing something

@kylebarron
Copy link
Author

kylebarron commented Sep 4, 2024

I'm referring to these lines:

altair/altair/utils/data.py

Lines 421 to 431 in 5207768

# First check if the dataframe object has a method to convert to arrow.
# Give this preference over the pyarrow from_dataframe function since the object
# has more control over the conversion, and may have broader compatibility.
# This is the case for Polars, which supports Date32 columns in direct conversion
# while pyarrow does not yet support this type in from_dataframe
for convert_method_name in ("arrow", "to_arrow", "to_arrow_table", "to_pyarrow"):
convert_method = getattr(dfi_df, convert_method_name, None)
if callable(convert_method):
result = convert_method()
if isinstance(result, pa.Table):
return result

Those may not be solely for Polars, but a primary goal of the PyCapsule Interface is to standardize the method name by which one library exports data to others. So instead of checking for all these possible names, you can use __arrow_c_stream__ under the hood.

So essentially:

-for convert_method_name in ("arrow", "to_arrow", "to_arrow_table", "to_pyarrow"): 
-    convert_method = getattr(dfi_df, convert_method_name, None) 
-    if callable(convert_method): 
-        result = convert_method() 
-        if isinstance(result, pa.Table): 
-            return result 
+ if hasattr(dfi_df, "__arrow_c_stream__"):
+    return pa.table(dfi_df)

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Sep 4, 2024

Polars input wouldn't go down those lines anyway, as it would already have been handled in the Narwhals path 😉 (and it wouldn't involve any conversion to pyarrow)

Isn't pycapsule interface adoption a bit too recent for it to be used here as the only way to covert to pyarrow? It would cut off support for oldish versions of Ibis / DuckDB for whom the current code works fine. But using it if it's available, in addition to the current code but before the interchange protocol, sounds like a good idea 👍

@kylebarron
Copy link
Author

Polars input wouldn't go down those lines anyway, as it would already have been handled in the Narwhals path 😉 (and it wouldn't involve any conversion to pyarrow)

Ah, I hadn't noticed that.

Isn't pycapsule interface adoption a bit too recent for it to be used here as the only way to covert to pyarrow? It would cut off support for oldish versions of Ibis / DuckDB for whom the current code works fine. But using it if it's available, in addition to the current code but before the interchange protocol, sounds like a good idea 👍

Yes, I should've been more clear about sometime in the future when you're ok with the pyarrow version constraint, you can remove those lines of code.

For the time being, I'd suggest it as an addition, not a replacement, to those existing checks.

@dangotbanned
Copy link
Member

@MarcoGorelli @jonmmease
Following these 2 upstream PRs, do you guys have any opinions on if this issue is resolved/requires additional work?

@MarcoGorelli
Copy link
Contributor

😄 yup, was discussing some things related to this today: duckdb/duckdb#15536

@dangotbanned
Copy link
Member

dangotbanned commented Jan 3, 2025

😄 yup, was discussing some things related to this today: duckdb/duckdb#15536

Thanks @MarcoGorelli! Interesting timing 😉

I'm gonna have a read through the doc as I'm also surprised by the behavior in duckdb

Update

Reading through these gives me the impression you'd need control over requesting a new stream (for this to be suitable in EDA).

I can't help but think that this is too low-level to find it's way into altair.

@dangotbanned dangotbanned added the vega: vegafusion Requires upstream/integration action w/ `vegafusion` label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement needs-info vega: vegafusion Requires upstream/integration action w/ `vegafusion`
Projects
None yet
Development

No branches or pull requests

3 participants