Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative output types with better compression and cross platform : .csv.gz & .parquet #20

Open
dfalster opened this issue Jun 30, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@dfalster
Copy link
Member

dfalster commented Jun 30, 2023

our current plain text output in AusTraits 4.2.1 is 260Mb, with most 230Mb is traits.csv. We use an *.rds filetype for smaller distribution, which compresses to just 12Mb. But this only works for R users.

@cboettig pointed out two alternative types comprised types that are possibly better: .csv.gz and .parquet

Both are cross platform and offer comparable compression to .rds. E.g.

  • traits.csv which is 230Mb could be compressed to 10.8Mb as traits.csv.gz, or 10.2Mb as traits.parquet

this will be particularly important when/if exporting wide format (see #19), which is

  • 4.2GB as .csv
  • 72Mb as .csv.gz
  • 26Mb as .parquet

Apache .parquet format (see https://en.wikipedia.org/wiki/Apache_Parquet) is rapidly emerging as new standard, accessible via the arrow package.

@dfalster dfalster changed the title laternative output: .csv.gz & .parquet alternative output types with better compression and cross platform : .csv.gz & .parquet Jun 30, 2023
@dfalster
Copy link
Member Author

Note also, parquet can be read just like csv:

x <- read_parquet("export/as_wide/austraits_wide.parquet")
x %>%
  filter(taxon_name == "Banksia serrata") %>%
  collect() -> y

Can also filter before read, It's slower to read but saves RAM by only loading the set you want

x <- arrow::open_dataset("export/as_wide/austraits_wide.parquet")

x %>%
  filter(taxon_name == "Banksia serrata") %>%
  collect() -> y

can do this on csv too

x <- arrow::open_dataset("export/as_wide/austraits_wide.csv.gz", format = "csv")

@dfalster dfalster added the enhancement New feature or request label Aug 30, 2023
@yangsophieee yangsophieee changed the title alternative output types with better compression and cross platform : .csv.gz & .parquet Alternative output types with better compression and cross platform : .csv.gz & .parquet Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant