Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check for empty schema in parquet::schema::types::from_thrift_helper #6990

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jan 17, 2025

Which issue does this PR close?

Closes #6988.

Rationale for this change

Reading a file with an empty schema will fail in parquet::schema::types::from_thrift_helper because the root node in the schema is mistaken for a leaf node.

What changes are included in this PR?

Adds a check in from_thrift_helper for a root node with no children, and exits early if one is detected.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 17, 2025

#[test]
// https://github.com/apache/arrow-rs/issues/6988
fn test_roundtrip_empty_schema() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is @samgqroberts's reproducer from #6988

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl and @samgqrobert 🙏

I have some suggestions on improving the tests, but I also think we can do that as a follow on PR (or never)

Comment on lines 3417 to 3418
let empty_fields: Vec<Field> = vec![];
let empty_schema = Arc::new(Schema::new(empty_fields));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor: you can use Schema::empty as well:

Suggested change
let empty_fields: Vec<Field> = vec![];
let empty_schema = Arc::new(Schema::new(empty_fields));
let empty_schema = Arc::new(Schema::new(empty_fields));

Comment on lines 3433 to 3436
// read from parquet
let bytes = Bytes::from(parquet_bytes);
let result = ParquetRecordBatchReaderBuilder::try_new(bytes);
result.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also

  1. check that the schema is correctly read back
  2. Verify that the reader works correctly (reading back no batches)

For example, something like:

Suggested change
// read from parquet
let bytes = Bytes::from(parquet_bytes);
let result = ParquetRecordBatchReaderBuilder::try_new(bytes);
result.unwrap();
// read from parquet
let bytes = Bytes::from(parquet_bytes);
let reader = ParquetRecordBatchReaderBuilder::try_new(bytes).unwrap();
assert_eq!(reader.schema(), &empty_batch.schema());
let batches: Vec<_> = reader.build().unwrap().collect::<ArrowResult<Vec<_>>>().unwrap();
assert_eq!(batches.len(), 0);

(I ran this locally and it passed so I think we could also make these changes as a follow on PR)

@etseidl
Copy link
Contributor Author

etseidl commented Jan 17, 2025

Thanks for the review @alamb. I made the changes you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RecordBatch with no columns cannot be roundtripped through Parquet
2 participants