Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple GC for view array types #5885

Merged
merged 10 commits into from
Jun 14, 2024
78 changes: 78 additions & 0 deletions arrow-array/src/array/byte_view_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,56 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
phantom: Default::default(),
}
}

/// Returns a buffer compact version of this array
XiangpengHao marked this conversation as resolved.
Show resolved Hide resolved
///
/// The original array will *not* be modified
///
/// # Garbage Collection
///
/// Before GC:
/// ```text
/// ┌──────┐
/// │......│
/// │......│
/// ┌────────────────────┐ ┌ ─ ─ ─ ▶ │Data1 │ Large buffer
/// │ View 1 │─ ─ ─ ─ │......│ with data that
/// ├────────────────────┤ │......│ is not referred
/// │ View 2 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
/// └────────────────────┘ │......│ View 2
/// │......│
/// 2 views, refer to │......│
/// small portions of a └──────┘
/// large buffer
/// ```
///
/// After GC:
///
/// ```text
/// ┌────────────────────┐ ┌─────┐ After gc, only
/// │ View 1 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│ data that is
/// ├────────────────────┤ ┌ ─ ─ ─ ▶ │Data2│ pointed to by
/// │ View 2 │─ ─ ─ ─ └─────┘ the views is
/// └────────────────────┘ left
///
///
/// 2 views
/// ```
/// This method will compact the data buffers by recreating the view array and only include the data
/// that is pointed to by the views.
///
/// Note that it will copy the array regardless of whether the original array is compact.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/// Use with caution as this can be an expensive operation, only use it when you are sure that the view
/// array is significantly smaller than when it is originally created, e.g., after filtering or slicing.
pub fn gc(&self) -> Self {
let mut builder = GenericByteViewBuilder::<T>::with_capacity(self.len());

for v in self.iter() {
builder.append_option(v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is reasonable (and correct) first version

However, it could likely be made faster by special casing inline views -- the code here is going to be doing several checks on length only to copy the bytes back into a u128

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how the checks can be avoided; even if the view is inlined, we need to change the offset and the block id, so we need to copy that u128 from the original view buffer to the new buffer anyway.

Copy link
Contributor

@alamb alamb Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was imagining that special casing copying inlined views could potentially be faster than converting to &str and then checking

But I don't think we should make any changes unless we have benchmarks supporting that hypothesis

}

builder.finish()
}
}

impl<T: ByteViewType + ?Sized> Debug for GenericByteViewArray<T> {
Expand Down Expand Up @@ -645,4 +695,32 @@ mod tests {

StringViewArray::new(views, buffers, None);
}

#[test]
fn test_gc() {
let mut array = StringViewArray::from_iter_values(["while my guitar gently weeps"]);
XiangpengHao marked this conversation as resolved.
Show resolved Hide resolved
// shrink the view
let mut view = ByteView::from(array.views[0]);
view.length = 15;
let new_views = ScalarBuffer::from(vec![view.into()]);
array.views = new_views;
let compacted = array.gc();
assert_ne!(array.buffers[0].as_ptr(), compacted.buffers[0].as_ptr());
assert_eq!(array.value(0), compacted.value(0));

// test compact on array containing null
let mut array =
StringViewArray::from_iter([None, Some("I don't know why nobody told you")]);

let mut view = ByteView::from(array.views[1]);
view.length = 15;
let new_views = ScalarBuffer::from(vec![array.views[0], view.into()]);
array.views = new_views;

let compacted = array.gc();

assert_ne!(array.buffers[0].as_ptr(), compacted.buffers[0].as_ptr());
assert_eq!(array.value(0), compacted.value(0));
assert_eq!(array.value(1), compacted.value(1));
}
}
Loading