Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[parquet] add parquet writer [#2044] #2053

Merged
merged 3 commits into from
Oct 18, 2023

Conversation

mr-majkel
Copy link
Contributor

@mr-majkel mr-majkel commented Oct 13, 2023

Hi @anjakefala @saulpw,

this PR adds a writer for parquet files (and solves [#2044]). It follows the same logic as the arrow writer, and has been tested to work with the basic column types.

However, when working on it, I have discovered that arrow (and thus parquet) writers are not handling properly more complex column types (lists, dicts, structs).

The loader is passing anything more complex as anytype column, that in case of using pyarrow means translating the types of columns with .to_py() method. This means that complex types (like lists or structs) get properly handled when loading (and visidata allows to "unfurl" columns.

However, when trying to write them as arrow or parquet, these types throw exceptions in pyarrow because of the mismatch between schema and the actual value. In schema, we try to cast anytype to string, that is failing for these columns.

A simple parquet data set to test it out can be created like this:

import pyarrow as pa
import pyarrow.parquet as pq

names = pa.array(["John", "Anna", "Sue", "Dave", "Barb"], type=pa.string())
wishlists = pa.array([["train"], ["doll", "car"], ["crayons", "coloring book"], ["sword"], ["soldiers", "tanks"]],
                     type=pa.list_(pa.string()))
meta = pa.array([{"money":6.2, "color":"green"},
                 {"money":5.0, "color":"red"},
                 {"money":7.1, "color":"black"},
                 {"money":4.0, "color":"grey"},
                 {"money":7.15, "color":""}],
                     type=pa.struct([("money", pa.float32()), ("color", pa.string())]))

birthdays = pa.table([names, wishlists, meta],
                     names = ["names", "wishlists", "meta"])


pq.write_table(birthdays, "test_path.parquet")

Overall I have a couple of questions:

  • there is no sample data for parquet, and no .vd in tests - should I them?
  • what do we do about the complex column types?
    • I guess, ideal solution would be to somehow figure out the schema of anytype column in visidata, so that it is properly represented in the format (arrow or parquet) on the disk.
    • this way reading a proper arrow or parquet file to visidata, saving it back, should yield the same column types as in the original
    • another way (that I think pandas is doing) is serializing to strings (or json?)
    • should it be handled under this PR? should I open issue for the complex types handling in arrow writer?

Edit: there some changes in other lines of the parquet loader that were caused by black - let me know if you want me to get rid of them.

@CLAassistant
Copy link

CLAassistant commented Oct 13, 2023

CLA assistant check
All committers have signed the CLA.

@saulpw
Copy link
Owner

saulpw commented Oct 13, 2023

Hi @mr-majkel, thanks for putting this together!

Please do add a sample parquet data file. @anjakefala please make sure we are testing loading and saving of .parquet.

For now, we don't have to deal with complex data types, unless you want to. Having any saver is better than no saver.

For dealing with complex types in anytype columns, I'd say it's reasonable to investigate the first row of data, and use the types of the actual values to inform the schema. Properly typed columns within parquet would be ideal, but serializing to JSON would be acceptable if it's compatible with the pandas serialization, and we change the VisiData loader to handle them as well. (In fact we might want to support this pandas serialization compatibility as an option regardless.) Let us know if you want to tackle this now and edit the PR to include them; otherwise we can merge this PR and open a new one later.

As for black formatting, the changes are pretty minimal so I'm not worried about them. The main thing is that it will make it seem like you are the main person on the parquet loader, which after this PR is not entirely false.

Thanks again and please send me an email with your address so I can send you some VisiData stickers!

@mr-majkel
Copy link
Contributor Author

Hi @saulpw and @anjakefala,

I have added a sample test set for parquet (based on sample data for tsv). I have also added a .vd tests for loader, that seem to pass when running through dev/test.sh 😄 .

I think, that adding "any" saver is a good first step. Let me then create an issue for saving more complex data columns and start working on it.

Copy link
Owner

@saulpw saulpw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mr-majkel! I'll let @anjakefala merge when she's had a chance to review.

@anjakefala
Copy link
Collaborator

Thank you so much @mr-majkel!

I think we maybe can't support the parquet saver in the tests...It seems there are mild changes between the binary files, even if you save the same file. Maybe it is recording the date of modification?

I am going to remove the parquet saver test, and then this seems great to go!

@anjakefala anjakefala merged commit e9c0d94 into saulpw:develop Oct 18, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants