-
-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[parquet] add parquet writer [#2044] #2053
Conversation
0c5c1bc
to
82f00bc
Compare
Hi @mr-majkel, thanks for putting this together! Please do add a sample parquet data file. @anjakefala please make sure we are testing loading and saving of .parquet. For now, we don't have to deal with complex data types, unless you want to. Having any saver is better than no saver. For dealing with complex types in anytype columns, I'd say it's reasonable to investigate the first row of data, and use the types of the actual values to inform the schema. Properly typed columns within parquet would be ideal, but serializing to JSON would be acceptable if it's compatible with the pandas serialization, and we change the VisiData loader to handle them as well. (In fact we might want to support this pandas serialization compatibility as an option regardless.) Let us know if you want to tackle this now and edit the PR to include them; otherwise we can merge this PR and open a new one later. As for black formatting, the changes are pretty minimal so I'm not worried about them. The main thing is that it will make it seem like you are the main person on the parquet loader, which after this PR is not entirely false. Thanks again and please send me an email with your address so I can send you some VisiData stickers! |
Hi @saulpw and @anjakefala, I have added a sample test set for parquet (based on sample data for tsv). I have also added a .vd tests for loader, that seem to pass when running through dev/test.sh 😄 . I think, that adding "any" saver is a good first step. Let me then create an issue for saving more complex data columns and start working on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @mr-majkel! I'll let @anjakefala merge when she's had a chance to review.
Thank you so much @mr-majkel! I think we maybe can't support the parquet saver in the tests...It seems there are mild changes between the binary files, even if you save the same file. Maybe it is recording the date of modification? I am going to remove the parquet saver test, and then this seems great to go! |
…ences even for the same file
Hi @anjakefala @saulpw,
this PR adds a writer for parquet files (and solves [#2044]). It follows the same logic as the arrow writer, and has been tested to work with the basic column types.
However, when working on it, I have discovered that arrow (and thus parquet) writers are not handling properly more complex column types (lists, dicts, structs).
The loader is passing anything more complex as
anytype
column, that in case of using pyarrow means translating the types of columns with.to_py()
method. This means that complex types (like lists or structs) get properly handled when loading (and visidata allows to "unfurl" columns.However, when trying to write them as arrow or parquet, these types throw exceptions in pyarrow because of the mismatch between schema and the actual value. In schema, we try to cast anytype to string, that is failing for these columns.
A simple parquet data set to test it out can be created like this:
Overall I have a couple of questions:
Edit: there some changes in other lines of the parquet loader that were caused by black - let me know if you want me to get rid of them.