Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Fix the inferred nullability when converting a nested parquet schema to arrow #1565

Conversation

jhorstmann
Copy link
Contributor

This allows the parquet_read example to correctly read the nested data attached to issue #1556 and also makes several test assertions match the comments above.

…to arrow

This allows the `parquet_read` example to correctly read the nested data
attached to issue jorgecarleitao#1556 and also makes several test assertions match the
comments above.
@jhorstmann
Copy link
Contributor Author

@ritchie46 or @sundy-li as the parquet expert, could you take a look at this PR? Or should I rather re-open this against pola-rs/nano-arrow?

@codecov
Copy link

codecov bot commented Oct 7, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (fb7b5fe) 83.06% compared to head (780b987) 83.08%.
Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1565      +/-   ##
==========================================
+ Coverage   83.06%   83.08%   +0.02%     
==========================================
  Files         391      391              
  Lines       42889    42929      +40     
==========================================
+ Hits        35626    35668      +42     
+ Misses       7263     7261       -2     
Files Coverage Δ
src/io/parquet/read/schema/convert.rs 95.03% <100.00%> (+0.35%) ⬆️

... and 7 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sundy-li sundy-li merged commit dd80c89 into jorgecarleitao:main Oct 7, 2023
@ritchie46
Copy link
Collaborator

Or should I rather re-open this against pola-rs/nano-arrow?

@jhorstmann could you please do so? That would get the patched in polars as well.

jhorstmann added a commit to jhorstmann/polars that referenced this pull request Oct 9, 2023
The `parquet_to_arrow_schema` function tries to set the `nullable` flag of
`Field` according to the parquet repetition levels. That flag is then used
via the `InitNested` `enum` to calculate the level at which data is valid.

In the following parquet message type, neither the list nor its elements
should be marked as `nullable`, the definition level for the `REPEATED`
group only indicates an empty list but not any nulls.

```
message eventlog {
  REQUIRED group events (LIST) {
    REPEATED group array {
      REQUIRED BYTE_ARRAY event_name (STRING);
      REQUIRED INT64 event_time (TIMESTAMP(MILLIS,true));
    }
  }
}
```

The nullability of fields was asserted in multiple tests, but those
assertions did not match the comments.

This is a port of jorgecarleitao/arrow2#1565 to the nano-arrow crate.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants