Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[arrow-cast] Support cast from Numeric (Int, UInt, etc) to Utf8View #6719

Closed
wants to merge 9 commits into from

Conversation

tlm365
Copy link
Contributor

@tlm365 tlm365 commented Nov 12, 2024

Which issue does this PR close?

Closes #6714.
related to #6373

Rationale for this change

Add support cast from numeric(Int/Float/Decimal) to string view (Utf8View).

What changes are included in this PR?

The cast logic and corresponding unit tests.

Are there any user-facing changes?

No.

Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 12, 2024
@tlm365
Copy link
Contributor Author

tlm365 commented Nov 12, 2024

@Omega359 During the code review, I found that it is possible to implement support cast for all numeric values (Int/Float/Decimal) to string view. I thought it would be good to implement it in 1 PR instead of creating new issue for each datatype -> so I implemented it in this PR. How do you think about it? If it fine, could you please update the issue description?

@tlm365 tlm365 force-pushed the cast_numeric_to_string_view branch from f4dfcda to 2a937fe Compare November 12, 2024 02:05
let nulls = array.nulls();
for i in 0..array.len() {
match nulls.map(|x| x.is_null(i)).unwrap_or_default() {
false => builder.append_value(formatter.value(i).try_to_string()?),
Copy link
Contributor

@tustvold tustvold Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more optimal to use the std::fmt::write support as for StringArray above.

As written this will allocate for every value which will be very expensive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold Thanks so much for reviewing.

It would be more optimal to use the std::fmt::write support as for StringArray above.
As written this will allocate for every value which will be very expensive

I get it now. Will try to implement it. TYSM ❤️

Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>
@@ -1462,6 +1464,9 @@ pub fn cast_with_options(
(BinaryView, _) => Err(ArrowError::CastError(format!(
"Casting from {from_type:?} to {to_type:?} not supported",
))),
(from_type, Utf8View) if from_type.is_primitive() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this also fixes the Timestamp -> Utf8View issue. It would be good to have tests for temporal -> Utf8View added to cover this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing the code, I realized that the Timestamp -> Utf8View cast is not supported yet.

The main issue comes from the current implementation of formatter.format.write (source) which currently only applies to DisplayIndex derives (source), but the Temporal datatype is implemented based on DisplayIndexState (source).

I think this issue deserves a separate PR to handle the temporal -> string view casting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll file another PR today to cover the temporal -> Utf8View case unless someone beats me to it.

@tlm365 tlm365 marked this pull request as draft November 15, 2024 01:11
Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com>
@tlm365 tlm365 force-pushed the cast_numeric_to_string_view branch from 1a6868a to 74de9bc Compare November 15, 2024 05:21
@tlm365 tlm365 marked this pull request as ready for review November 16, 2024 02:12
@alamb alamb changed the title [arrow-cast] Support cast numeric to string view [arrow-cast] Support cast from Decimal (numeric) to Utf8View Nov 22, 2024
@alamb alamb changed the title [arrow-cast] Support cast from Decimal (numeric) to Utf8View [arrow-cast] Support cast from Numeric (Int, UInt, etc) to Utf8View` Nov 22, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @tlm365 , @Omega359 and @tustvold

@@ -484,6 +484,13 @@ impl<T: ByteViewType + ?Sized, V: AsRef<T::Native>> Extend<Option<V>>
/// ```
pub type StringViewBuilder = GenericByteViewBuilder<StringViewType>;

impl std::fmt::Write for StringViewBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also what is contemplated by #6373 (aka I think this PR fixes that ticket as well)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a problem with the Write implementation.

Here is a PR showing the problem: tlm365#1

@@ -484,6 +484,13 @@ impl<T: ByteViewType + ?Sized, V: AsRef<T::Native>> Extend<Option<V>>
/// ```
pub type StringViewBuilder = GenericByteViewBuilder<StringViewType>;

impl std::fmt::Write for StringViewBuilder {
fn write_str(&mut self, s: &str) -> std::fmt::Result {
self.append_value(s);
Copy link
Contributor

@alamb alamb Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was writing some tests for this, and it turns out this is different behavior than StringViewBuilder

https://docs.rs/arrow/latest/arrow/array/builder/type.GenericStringBuilder.html#example-incrementally-writing-strings-with-stdfmtwrite

Specifically, calling write_str doesn't compete the row 🤔

I made a PR showing the problem: tlm365#1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on a potential solution so we can unblock this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Nov 22, 2024

Here is proposal for how to implement std::fmt::Write: #6777

@alamb alamb changed the title [arrow-cast] Support cast from Numeric (Int, UInt, etc) to Utf8View` [arrow-cast] Support cast from Numeric (Int, UInt, etc) to Utf8View Nov 22, 2024
Add documentation examples for `StringViewBuilder::write_str`
@tlm365 tlm365 marked this pull request as draft November 23, 2024 01:30
Comment on lines +41 to +55
pub(crate) fn value_to_string_view(
array: &dyn Array,
options: &CastOptions,
) -> Result<ArrayRef, ArrowError> {
let mut builder = StringViewBuilder::with_capacity(array.len());
let formatter = ArrayFormatter::try_new(array, &options.format_options)?;
let nulls = array.nulls();
for i in 0..array.len() {
match nulls.map(|x| x.is_null(i)).unwrap_or_default() {
true => builder.append_null(),
false => formatter.value(i).write(&mut builder)?,
}
}
Ok(Arc::new(builder.finish()))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been dreaming about this PR and how to unblock it.

I think it is currently stalled by trying to figure out how to handle std::fmt::Write

Here is an alternate proposal:

  1. Change this implementation to avoid reallocating on each row, but still copy
  2. Remove the std::fmt::Write implementation (and we can sort that out in a different PR)

So the first point might look something like this:

Suggested change
pub(crate) fn value_to_string_view(
array: &dyn Array,
options: &CastOptions,
) -> Result<ArrayRef, ArrowError> {
let mut builder = StringViewBuilder::with_capacity(array.len());
let formatter = ArrayFormatter::try_new(array, &options.format_options)?;
let nulls = array.nulls();
for i in 0..array.len() {
match nulls.map(|x| x.is_null(i)).unwrap_or_default() {
true => builder.append_null(),
false => formatter.value(i).write(&mut builder)?,
}
}
Ok(Arc::new(builder.finish()))
}
pub(crate) fn value_to_string_view(
array: &dyn Array,
options: &CastOptions,
) -> Result<ArrayRef, ArrowError> {
let mut builder = StringViewBuilder::with_capacity(array.len());
let formatter = ArrayFormatter::try_new(array, &options.format_options)?;
let nulls = array.nulls();
// buffer to avoid reallocating on each value
// TODO: replace with write to builder after https://github.com/apache/arrow-rs/issues/6373
mut buffer = String::new();
for i in 0..array.len() {
match nulls.map(|x| x.is_null(i)).unwrap_or_default() {
true => builder.append_null(),
false => {
// write to buffer first and then copy into target array
buffer.clear();
formatter.value(i).write(&mut buffer)?,
bulder.append_value(&buffer)
}
}
}
Ok(Arc::new(builder.finish()))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work nicely (well, once it compiles) and would unblock downstream issues. 👍🏻

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to work down my review queue / reduce the number of things I am trying to keep track of and we have also gone around too much with this PR on you @tlm365 -- for that I apoloigize

Here is a PR hopefully that will solve #6714:

@tlm365
Copy link
Contributor Author

tlm365 commented Dec 2, 2024

Oops, so sorry everyone for the late update 🥹

I am trying to work down my review queue / reduce the number of things I am trying to keep track of and we have also gone around too much with this PR on you @tlm365 -- for that I apologize

I should apologize, thank you for completing it @alamb ❤️.

@alamb
Copy link
Contributor

alamb commented Dec 2, 2024

Let's call it a good team effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Numeric -> Utf8View casting
4 participants