Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secor Avro parquet write invalid parquet file #1718

Open
richiesgr opened this issue Nov 23, 2020 · 0 comments
Open

Secor Avro parquet write invalid parquet file #1718

richiesgr opened this issue Nov 23, 2020 · 0 comments

Comments

@richiesgr
Copy link

Hi
I'm trying to write Avro message to parquet on GCS. These parquet should be query by big query.
First we didn't notice any problem using Spark meaning the same file are read with Spark without any problem all is working perfect.
Now using Big query bring and error like this :
Error while reading table: , error message: Read less values than expected: Actual: 29333, Expected: 33827. Row group: 0, Column: , File:

After investigation using parquet-tools I figured out that in parquet there is metadata regarding number total of unique values for each columns eg from parquet-tools
page 0: DLE:BIT_PACKED RLE:BIT_PACKED [more]... CRC:[PAGE CORRUPT] VC:547

So the VC value indicate that the total number of unique value in the file is 547.
Now when make a spark SQL like SELECT DISTINCT COUNT(column) FROM ... I get 421 mean this number in the metadata is incorrect.
So what is not a problem for Spark to read is a blocking problem for Big data because it relies on these values and found it incorrect.
Do you have an idea of what can cause this ?
Is there something that can be configured to write parquet ?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant