-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for deserializing list-encoded JSON structs [#6558] #6643
base: main
Are you sure you want to change the base?
Conversation
8fc7592
to
7181f92
Compare
Sorry this is on my list to look into, but I've struggled to find time to look into it yet... I appreciate your patience and restraint from repeatedly tagging me (unlike some people are wont to do). |
Are there any changes I could make to this PR to make it easier to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @jagill and i very much apologize for the delay in reviewing -- we are always looking for more help:
TLDR thoughts:
- Are you planning / willing to implement serializing support too (aka support in json_writer)? That would then allow us to verify that data can be round-tripped
- I believe this PR slows down performance (notes below)
API
I've made a JsonParseMode struct instead of a bool flag for two reasons. One is that it's self-descriptive (what would true be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone.
I agree that a struct is nice for the reasons you mention
Errors
I kept the error messages as similar to the old messages as possible. I considered having more specific error messages (like "Encountered a '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but wanted to hear opinions before I went that route.
I think improved error messages are always better, but we can do them as a follow on PR
I'm not attached to any name/code-style/etc, so happy to modify to fit local conventions.
💯
Performance
One requirement was that benchmarks do not regress. My running of benchmarks have been inconclusive (see https://gist.github.com/jagill/6749248171a1f12fb7c653ff70c5ed42). There are often small regressions or improvements in the single-digit % range whenever I switch between master and this PR. I suspect they are statistical but I wanted to note these.
I am not sure what benchmarks you ran -- I think the most relevant ones are in arrow/src/json_reader.rs
I ran them like cargo bench --bench json_reader
Here are three back to back runs I did on my test bench:
++ critcmp master json-struct-from-list
group json-struct-from-list master
----- --------------------- ------
large_bench_primitive 1.17 4.2±0.05ms ? ?/sec 1.00 3.6±0.01ms ? ?/sec
small_bench_list 1.03 14.5±0.05µs ? ?/sec 1.00 14.1±0.17µs ? ?/sec
small_bench_primitive 1.00 7.2±0.04µs ? ?/sec 1.00 7.2±0.02µs ? ?/sec
group json-struct-from-list master
----- --------------------- ------
large_bench_primitive 1.16 4.2±0.03ms ? ?/sec 1.00 3.6±0.01ms ? ?/sec
small_bench_list 1.03 14.5±0.05µs ? ?/sec 1.00 14.0±0.13µs ? ?/sec
small_bench_primitive 1.00 7.2±0.03µs ? ?/sec 1.01 7.2±0.03µs ? ?/sec
++ critcmp master json-struct-from-list
group json-struct-from-list master
----- --------------------- ------
large_bench_primitive 1.15 4.2±0.01ms ? ?/sec 1.00 3.7±0.01ms ? ?/sec
small_bench_list 1.00 14.4±0.10µs ? ?/sec 1.00 14.4±0.08µs ? ?/sec
small_bench_primitive 1.00 7.1±0.03µs ? ?/sec 1.06 7.6±0.05µs ? ?/sec
++ popd
~/datafusion-benchmarking
It shows a pretty consistent slowdown in large_bench_primitive -- I left some suggestions on how to improve this
Again, sorry for the delay in reviewing
@@ -71,42 +75,70 @@ impl ArrayDecoder for StructArrayDecoder { | |||
.then(|| BooleanBufferBuilder::new(pos.len())); | |||
|
|||
for (row, p) in pos.iter().enumerate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance wise, this code now has another conditional in the hot loop (each row)
for (row, p) in pos.iter().enumerate() {
if self.struct_parse_mode {
..
}
}
I suspect the code would be fast (and easier to understand / cleaner) if you hoisted the check, something like
if self.struct_parse_mode {
for (row, p) in pos.iter().enumerate() {
..
}
else {
for (row, p) in pos.iter().enumerate() {
..
}
}
arrow-json/src/reader/mod.rs
Outdated
@@ -170,12 +170,30 @@ mod struct_array; | |||
mod tape; | |||
mod timestamp_array; | |||
|
|||
/// Specifies what is considered valid JSON when parsing StructArrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a link to documentation / prior art for this type of JSON encoding of structs? Maybe to trino / presto docs that describe it in more full detail that we can add to this doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, the documentation of the format of the returned data is a little lacking. They describe that the top-level row is a list with one entry per column (https://prestodb.io/docs/current/develop/client-protocol.html#important-queryresults-attributes, https://trino.io/docs/current/develop/client-protocol.html#important-queryresults-attributes), but do not specify how any structural elements are serialized (or non-structural things like Datetimes etc). If it helps, I could add this to the Presto documentation.
7181f92
to
c3b5fc9
Compare
Ok, I've hoisted the conditional outside the loop. It also had the effect (as you predicted) that the gnarly match statement for the delimiters is a bit easier to understand.
I was not planning to, because there was no use-case requiring it. But I'm certainly willing to, both for symmetry and for round-tripping. Is that something you'd want in this PR, or a follow-up?
When I check, I'm consistently seeing no difference between the main branch and either of the variants:
(btw, I didn't know about critcmp; thanks for the pointer!) However, I'm not running in an isolated environment, so the differences I saw in the past seemed to be related to other processes. I'm running on a M2 Mac and haven't tried on with x86 or other architectures, and haven't been able to isolate from the helpful random processes that Macos spins up, so I'll trust your benchmarks. |
Thank you very much again @jagill for this contribution. Sorry for the delay in reviewing / that this PR is somewhat stuck From my perspective having partial support for a feature (list encoded json structs) is non ideal and will increase the maintenance burden if we are not careful. I think it runs the risk of being a little used / unknown feature. So in order to move this forward I think we need:
|
5449cba
to
38d3d88
Compare
@alamb: With this update I've
One thing that makes me sad is that the Writer has an
I chose 3 but in theory I'm open to any of these (or other suggestions). In practice, I think that 1 would involve too much yak shaving, and we can always do the breaking change combining the two options later. I'm also happy to add more documentation for rational or examples, with the following caveats:
|
Awesoe -- thank you @jagill - I will try and review this PR later today or tomorrow |
arrow-json/src/lib.rs
Outdated
/// Specifies what is considered valid JSON when parsing StructArrays. | ||
/// | ||
/// If a struct with fields `("a", Int32)` and `("b", Utf8)`, it could be represented as | ||
/// a JSON object (`{"a": 1, "b": "c"}`) or a JSON list (`[1, "c"]`). This enum controls | ||
/// which form(s) the Reader will accept. | ||
/// | ||
/// For objects, the order of the key does not matter. | ||
/// For lists, the entries must be the same number and in the same order as the struct fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current doc here seems specific to reading/decoding; should also mention write since same enum is used as write option.
Could also put extra information such as justification for ListOnly (i.e. condensed version assuming schema is known separately, as you've explained in PR)
arrow-json/src/reader/mod.rs
Outdated
@@ -195,6 +197,7 @@ impl ReaderBuilder { | |||
coerce_primitive: false, | |||
strict_mode: false, | |||
is_field: false, | |||
struct_mode: StructMode::ObjectOnly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct_mode: StructMode::ObjectOnly, | |
struct_mode: Default::default(), |
Considering we already have default defined for it
arrow-json/src/reader/mod.rs
Outdated
@@ -235,6 +238,7 @@ impl ReaderBuilder { | |||
coerce_primitive: false, | |||
strict_mode: false, | |||
is_field: true, | |||
struct_mode: StructMode::ObjectOnly, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct_mode: StructMode::ObjectOnly, | |
struct_mode: Default::default(), |
arrow-json/src/reader/mod.rs
Outdated
/// Set the [`StructMode`] for the reader, which determines whether | ||
/// structs can be represented by JSON objects, lists, or either. | ||
/// | ||
/// For example, if the RecordBatch Schema is | ||
/// `[("a", Int32), ("r", Struct([("b", Boolean), ("c", Utf8)]))]` | ||
/// then [`StructMode::ObjectOnly`] would read rows of the form | ||
/// `{"a": 1, "r": {"b": true, "c": "cat"}}` | ||
/// while ['StructMode::ListOnly'] would read rows of the form | ||
/// `[1, [true, "cat"]]` | ||
/// | ||
/// JSON data of this form is returned by | ||
/// [Presto](https://prestodb.io/docs/current/develop/client-protocol.html#important-queryresults-attributes) | ||
/// and [Trino](https://trino.io/docs/current/develop/client-protocol.html#important-queryresults-attributes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this documentation is better consolidated within StructMode
docstring, to avoid duplication. e.g. can just simplify this to something like:
/// [`StructMode`] defines how structs can be decoded from JSON; for more details refer to the enum documentation.
arrow-json/src/writer/mod.rs
Outdated
/// Set whether to write structs as JSON Objects or Lists. | ||
/// | ||
/// For example, a writer (with [`LineDelimited`] format) writing the schema | ||
/// `[("a", Int32), ("m": Struct<b: Boolean, c: Utf8>)] would write with | ||
/// `StructMode::ObjectOnly`: | ||
/// | ||
/// ```json | ||
/// {"a": 1, "m": {"b": true, "c": "cat"}} | ||
/// ``` | ||
/// | ||
/// With `StructMode::ListOnly`: | ||
/// | ||
/// ```json | ||
/// [1, [true, "cat"]] | ||
/// ``` | ||
/// | ||
/// Map columns are not affected by this option. | ||
/// | ||
/// Default is to use `ObjectOnly`. If this is set to `ListOnly`, nulls will | ||
/// be written explicitly regardless of the `explicit_nulls` setting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here for writer docstring, which is similar to the read version
arrow-json/src/writer/encoder.rs
Outdated
for field_encoder in &mut self.encoders { | ||
let is_null = is_some_and(field_encoder.nulls.as_ref(), |n| n.is_null(idx)); | ||
if is_null && !self.explicit_nulls { | ||
if is_null && drop_nulls { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if is_null && drop_nulls { | |
if drop_nulls && is_null { |
I wonder if this makes any difference, as drop_nulls
should hopefully short-circuit; but maybe compiler can already figure this out 😅
let field_name = match tape.get(cur_idx) { | ||
TapeElement::String(s) => tape.get_string(s), | ||
_ => return Err(tape.error(cur_idx, "field name")), | ||
if self.struct_mode == StructMode::ObjectOnly { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be a match, just in case StructMode
is expanded with more variants in future
} | ||
} | ||
} else { | ||
for (row, p) in pos.iter().enumerate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice for the ListOnly
branch, self.strict_mode
isn't being considered unlike above; is this noteworthy in a similar way to how explicit_nulls
only applies for ObjectOnly
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, strict mode is implied for Lists, since you can't skip a field and know what field was skipped (since you have no field names). I've updated the documentation in with_strict_mode
to make this explicit. It just occurs to me that strict_mode
and struct_mode
are very similar names.
@@ -2343,4 +2374,219 @@ mod tests { | |||
.unwrap() | |||
); | |||
} | |||
|
|||
#[test] | |||
fn test_struct_decoding_list_length() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth adding a test for decoding to wrong datatype?
38d3d88
to
3adabed
Compare
@Jefffrey I addressed your comments; thanks for the feedback! |
Currently, a StructArray can only be deserialized from or serialized to a JSON object (e.g. `{a: 1, b: "c"}`), but some services (e.g. Presto and Trino) encode ROW types as JSON lists (e.g. `[1, "c"]`) because this is more compact, and the schema is known. This PR adds the ability to encode and decode JSON lists from and to StructArrays, if StructMode is set to ListOnly. In ListOnly mode, object-encoded structs raise an error. Setting to ObjectOnly (the default) has the original parsing behavior. Some notes/questions/points for discussion: 1. I've made a JsonParseMode struct instead of a bool flag for two reasons. One is that it's self-descriptive (what would `true` be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone. 2. I kept the error messages as similar to the old messages as possible. I considered having more specific error messages (like "Encountered a '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but wanted to hear opinions before I went that route. 3. I'm not attached to any name/code-style/etc, so happy to modify to fit local conventions. Fixes apache#6558
3adabed
to
e6e4031
Compare
Which issue does this PR close?
Closes #6558.
Rationale for this change
Currently, a StructArray can only be deserialized from a JSON object (e.g.
{a: 1, b: "c"}
), but some services (e.g. Presto and Trino) encode ROW types as JSON lists (e.g.[1, "c"]
) because this is more compact, and the schema is known.Arrow-json cannot currently deserialize these.
What changes are included in this PR?
This PR adds the ability to parse JSON lists into StructArrays, if the StructParseMode is set to ListOnly. In ListOnly mode, object-encoded structs raise an error. Setting to ObjectOnly (the default) has the original parsing behavior.
Are there any user-facing changes?
Users may set the
StructParsingMode
enum toListOnly
to parse list-style structs. The associated enum,variants, and method have been documented. I'm happy to update any other documentation.
Discussion topics
true
be?), and the other is that it allows a future Mixed mode that could deserialize either. The latter isn't currently requested by anyone.