Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds partial cast support for run-end encoded arrays #6752

Closed
wants to merge 11 commits into from

Conversation

RyanMarcus
Copy link

@RyanMarcus RyanMarcus commented Nov 18, 2024

Which issue does this PR close?

Helps address part of " Support REE in cast kernels" in #3520

Rationale for this change

Adds support for (limited) casting/conversion of REE arrays.

What changes are included in this PR?

  • Adds casting to change the type of the run ends in an REE array (e.g., convert from storing each run end as a Int16 to a Int32)
  • Adds casting from REE arrays of all primitive types to a "flat" array (which is a PrimitiveArray under the hood).

We've added these specific casts because we needed them for our own project, and figured we could contribute them back.

Are there any user-facing changes?

Currently, can_cast_type basically returns false for anything involving an REE array. This PR implements a subset of important casts, but not all of them. Users may be surprised that, for example, a REE array of Int32 can be converted to an array of Int32, but an REE array of Utf8 can't be converted to a Utf8 array (yet).

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 18, 2024
@RyanMarcus RyanMarcus marked this pull request as ready for review November 18, 2024 21:30
@RyanMarcus RyanMarcus marked this pull request as draft November 18, 2024 23:23
…e values to primitive arrays

Added copyright header
@RyanMarcus RyanMarcus marked this pull request as ready for review November 18, 2024 23:33
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a bit about the amount of codegen this approach will lead to, as it relies on downcasting the values. This also prevents it generalising to nested types.

I wonder if you considered using MutableArrayData or one of the existing selection kernels (e.g. take)?

@RyanMarcus
Copy link
Author

RyanMarcus commented Nov 19, 2024

I know you just gave it as an example, but take would be significantly slower since it would require materializing an index array, which can't be trivially computed from the run ends (unlike for dictionaries, where the values are the indices).

I took a brief look at the other existing kernels and didn't see anything that looked promising. If you can point me to something specific I'm happy to take a look!

I'm not currently familiar with MutableArrayData, let me take a look and get back to you. Edit: Just looked at MutableArrayData which seems like a perfect candidate for Copy types. Let me try to change my implementation to that, I agree it has potential to cut down a lot on codegen.

One mitigating factor to the extra codegen is the fact that this is limited to the cast subcrate. We could also add an arrow-cast-ree or similar, if that's a concern.

* Added `extend_n` (with dummy implementation) to MutableArrayData
@RyanMarcus
Copy link
Author

As you suspected @tustvold , MutableArrayData enabled an implementation that was much smaller! And should reuse all the existing generics codegen from transform.

In order to make this efficient, we need a extend_n function on MutableArrayData that copies the same bits many times. Here is a dummy (working) implementation:

    pub fn extend_n(&mut self, index: usize, start: usize, end: usize, n: usize) {
        for _ in 0..n {
            self.extend(index, start, end);
        }
    }

I can see that MutableArrayData uses some kind of dynamic dispatch magic for each call to extend which I don't fully understand, but I understand enough to know that my current extend_n is no good. I'll mark this PR as a draft until I figure it out.

@RyanMarcus RyanMarcus marked this pull request as draft November 19, 2024 01:55
@RyanMarcus
Copy link
Author

RyanMarcus commented Nov 19, 2024

I've added a more efficient extend_n function that works by passing the count parameter n to each closure. But I realized that in the worst case scenario, where each run length is 1, this approach will still do interpretation for every value.

A simple benchmark converting a REE array of i32s to a primitive array:

With PrimitiveBuilder:
cast run end to flat    time:   [34.395 ms 34.453 ms 34.519 ms]

With MutableArrayData:
cast run end to flat    time:   [48.710 ms 48.869 ms 49.067 ms]

... so the avoiding the interpretation overhead seems to cause a 30% speedup. Does a 30% performance improvement in the worst case justify the extra codegen, @tustvold ?

Perhaps a reasonable middle ground would be to use the PrimitiveBuilder for primitive types, but use MutableArrayData for non-primitive types?

@RyanMarcus RyanMarcus marked this pull request as ready for review November 19, 2024 19:25
@tustvold
Copy link
Contributor

tustvold commented Nov 19, 2024

What is the average run length in that case?

We could specialise primitives, and there are ways we can reduce codegen - e.g. only specialize i32 run ends and only specialise on ArrowNativeType not ArrowPrimitiveType like we do for dictionaries.

However, given how few people use RunArray, there are relatively few scenarios it makes sense, we need to err on the side of keeping codegen manageable

@RyanMarcus
Copy link
Author

RyanMarcus commented Nov 19, 2024

My benchmark is a worst-case scenario, so every run length is 1, and thus the average is 1 as well. Not a realistic scenario, but illustrative of the worst case.

If we increase all run lengths to 10 (which is the average in my application, at least) and keeping the logical data size the same, the results are:

With PrimitiveBuilder:
cast run end to flat    time:   [11.740 ms 11.778 ms 11.818 ms]

With MutableArrayData:
cast run end to flat    time:   [21.837 ms 21.917 ms 22.000 ms]

Both approaches get faster, but the relative gap is larger.

The current version hits the compromise you mentioned: the specialized kernel is used for {i/u/f}{8/16/32/64}, and the interpretation-powered one is used for the rest.

I am not fully confident I correctly modified all of the MutableArrayData to handle the extend_n function, but the tests at least pass. If you want, I can swap back to the "dummy" version I proposed earlier, since the specialized kernel should hit the most common cases.

@tustvold
Copy link
Contributor

If you want, I can swap back to the "dummy" version I proposed earlier, since the specialized kernel should hit the most common cases.

Let's swap it back

@@ -759,6 +763,12 @@ pub fn cast_with_options(
"Casting from type {from_type:?} to dictionary type {to_type:?} not supported",
))),
},
(RunEndEncoded(re_t, _dt), _) => match re_t.data_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all 3 or could we just handle int32 and cast Int8 and Int16 to this. This is what we do for DictionaryArray?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine copying Int16 data up to Int32 would have a pretty large performance penalty in this case. For DictionaryArray, each access to the values is potentially random and therefore a cache miss, but for REE we will always read the run ends and the values sequentially. So my guess is that an extra data copy like that would 2x the runtime. Does that make sense or should I write a bench?

Copy link
Contributor

@tustvold tustvold Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine copying Int16 data up to Int32 would have a pretty large performance penalty in this case

I think we have to be pragmatic here, ultimately Int8 and Int16 are somewhat useless, only able to support arrays of maximum length 128 and 32768 respectively. I'm honestly not entirely sure why they were standardised...

Another option might be to fallback to ArrayData for the Int8, Int16 and Int64 cases. We really need to try to keep this change small

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, the standard is only Int16, Int32, and Int64, no Int8 (which is reflected in the code). I think Int16 / the smaller integer types are actually the most common (this is also the case in Parquet's dictionary run length encoding).

I didn't realize that even this amount of extra codegen was an issue. If I am literally the only user of this feature, we simply shouldn't do this at all. Honestly, this won't even solve my problem, since a 30%-2x slowdown on some types means I will just have to implement the full thing in my own code anyway. Luckily in my project, we are significantly less constrained by code size.

Probably the "right thing" to do is somehow split up cast into pieces, so people can opt into what they need, either with feature flags or with more subcrates, but I think that's a pretty large refactor.

Copy link
Contributor

@tustvold tustvold Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also the case in Parquet's dictionary run length encoding

Parquet is run length, this is run end, and consequently the primitive type constrains the maximum length of the array, as opposed to the maximum length of a run.

I'm sorry that we can't just add more codegen, but it has been a perennial issue for us that the full arrow specification is a combinatorial explosion, and so we have to pick some compute-optimised versions, and accept that space-optimised variants may incur a slight performance penalty. We should probably do a better job documenting this

code size.

It's actually build time that is the biggest pain point, there was a time when half the build time of the entire workspace was just the dictionary comparison kernels 😅

cast_options: &CastOptions,
) -> Result<ArrayRef, ArrowError> {
let ree_array = array
.as_any()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using AsArray may be more concise

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean this trait? I don't think there's an as_run_array or similar. I can add it, but the overall line count will go up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a good addition, happy for that to be a separate PR though, there are many other places that could use AsArray though in this diff

// Potentially convert to a new value or run end type
(_, DataType::RunEndEncoded(re_t, dt)) => {
let values = cast_with_options(ree_array.values(), dt.data_type(), cast_options)?;
let re = PrimitiveArray::<K>::new(ree_array.run_ends().inner().clone(), None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables converting REE arrays of one type to another, for example from Int32 runs and Float64 data to Int16 runs and Float32 data. See test_run_end_to_run_end for an example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let re = PrimitiveArray::<K>::new(ree_array.run_ends().inner().clone(), None);
let re = ree_array.run_ends().clone();

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even just pass re_array.run_ends() to the cast kernel directly...

)))?,
};

Ok(result.slice(ree_array.run_ends().offset(), ree_array.run_ends().len()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this slicing needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we construct a PrimitiveArray from the run ends buffer, we have to construct the PrimitiveArray from the inner buffer, which doesn't have the offset information. So we have to re-slice it here. Added some code to test_run_end_to_run_end to confirm this.

arrow-cast/src/cast/runend.rs Show resolved Hide resolved
arrow-cast/src/cast/runend.rs Outdated Show resolved Hide resolved
result = PrimitiveArray::<T>::new(result.values().clone(), nb);
}

// TODO: this slice could be optimized by only copying the relevant parts of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be as simple as clamping the run end in the loop to be in the sliced range

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having a hard time getting the logic right for a case like this:

run ends: [5, 10]
slice offset: 3 len: 5

... because you have to take 2 of the first value and then 3 of the second value. I can give it a shot if you want but I think it'll just be a source of bugs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not tested but I think this should work. My concern with the slicing logic, is such things have a storied history of subtle bugs.

let min_end = arr.offset();
let max_end = arr.offset() + arr.length();
let last = min_end;
for (run_end, val) in arr.run_ends().values().iter().zip(null_buffer.iter()) {
    let run_end = clamp(run_end.as_usize(), min_end, max_end);
    let run_length = run_end - last;
    if val {
        nbb.append_n_non_nulls(run_length);
    } else {
        nbb.append_n_nulls(run_length);
    }

    last = run_end;
}

@RyanMarcus
Copy link
Author

See #6752 (comment) -- I don't think this is worthwhile if minimizing codegen is such a high priority. If someone else is looking at this PR, feel free to use my kernels for your REE arrays!

@RyanMarcus RyanMarcus closed this Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants