Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement std::fmt::Write for StringViewBuilder to permit non contiguous data to be written #6373

Open
alamb opened this issue Sep 9, 2024 · 4 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Sep 9, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

DataFusion has an optimized version of concat(col1, ...) for StringView added by @devanbenz in apache/datafusion#12224 that uses a StringViewArrayBuilder which is similar but not the same as StringViewBuilder in arrow: https://github.com/apache/datafusion/blob/9bc39a0522840ed90de2a4d23157de2e192cd00f/datafusion/functions/src/string/common.rs#L464-L536

The major differences are:

  1. You can call write to incrementally build up each string and then call append_offset to create each string. StringBuilder requires each input to be a single contiguous string to call
  2. You can avoid creating the null buffer and pass it in to the finish

Describe the solution you'd like
I would like the APIs in arrow-rs to be sufficiently complete that we could use the arrow-rs versions rather than our own custom versions in DataFusion

Describe alternatives you've considered
StringBuilder allows this like

use std::fmt::Write;
use arrow_array::builder::GenericStringBuilder;
let mut builder = GenericStringBuilder::<i32>::new();

// Write data in multiple `write!` calls
write!(builder, "foo").unwrap();
write!(builder, "bar").unwrap();
// The next call to append_value finishes the current string
// including all previously written strings.
builder.append_value("baz");

let array = builder.finish();
assert_eq!(array.value(0), "foobarbaz");

I think it would be cool to try and follow the same API for StringViewBuilder -- though note that API may be more complicated to ensure we don't get any performance regressions

Additional context

  1. See Improve GenericStringBuilder documentation #6372 for better docs of what StringBuilder does
  2. Ability to append non contiguous strings to StringBuilder #6347 lists more background
@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Sep 9, 2024
@alamb
Copy link
Contributor Author

alamb commented Sep 9, 2024

cc @devanbenz in case you are interested in trying to figure out what we need to add to the upstream arrow code to get rid of the custom implementation in DataFusion

@devanbenz
Copy link

Just now getting eyes on this FYI

@alamb
Copy link
Contributor Author

alamb commented Sep 19, 2024

No rush -- I think we have plenty of other things going on -- I would say this is a nice to have in my mind

@alamb
Copy link
Contributor Author

alamb commented Nov 22, 2024

BTW @tlm365 made a PR here #6719 that needed it for some other reasons, However, it seems like it needs a bit more effort 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants