diff --git a/inst/WORDLIST b/inst/WORDLIST index 288f396..f9dab00 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -1,15 +1,25 @@ BROTLI +BSON CMD DuckDB +ENUM GZIP Gzip +INTSXP +JSON +LGLSXP LLC LZ LZO +MILLIS MacBook ORCID PBC +POSIXct +REALSXP RLE +STRSXP +UUID ZSTD Zstd codec diff --git a/man/nanoparquet-types.Rd b/man/nanoparquet-types.Rd index 210d2de..8e01500 100644 --- a/man/nanoparquet-types.Rd +++ b/man/nanoparquet-types.Rd @@ -8,8 +8,48 @@ How nanoparquet maps R types to Parquet types. } \section{R's data types}{ When writing out a data frame, nanoparquet maps R's data types to Parquet -logical types. This is how the mapping is performed. +logical types. The following table is a summary of the mapping. For the +details see below.\tabular{llcl}{ + R type \tab Parquet type \tab Default \tab Notes \cr + character \tab STRING (BYTE_ARRAY) \tab x \tab I.e. STRSXP. Converted to UTF-8. \cr + " \tab BYTE_ARRAY \tab \tab \cr + " \tab FIXED_LEN_BYTE_ARRAY \tab \tab \cr + " \tab ENUM \tab \tab \cr + " \tab UUID \tab \tab \cr + Date \tab DATE \tab x \tab \cr + difftime \tab INT64 \tab x \tab If not hms::hms. Arrow metadata marks it as Duration(NS). \cr + factor \tab STRING \tab x \tab Arrow metadata marks it as a factor. \cr + " \tab ENUM \tab \tab \cr + hms::hms \tab TIME(true, MILLIS) \tab x \tab Sub-milliseconds precision is lost. \cr + integer \tab INT(32, true) \tab x \tab I.e. INTSXP. \cr + " \tab INT64 \tab \tab \cr + " \tab INT96 \tab \tab \cr + " \tab DECIMAL (INT32) \tab \tab \cr + " \tab DECIMAL (INT64) \tab \tab \cr + " \tab INT(8, *) \tab \tab \cr + " \tab INT(16, *) \tab \tab \cr + " \tab INT(32, signed) \tab \tab \cr + list \tab BYTE_ARRAY \tab \tab Must be a list of raw vectors. Messing values are \code{NULL}. \cr + " \tab FIXED_LEN_BYTE_ARRAY \tab \tab Must be a list of raw vectors of the same length. Missing values are \code{NULL}. \cr + logical \tab BOOLEAN \tab x \tab I.e. LGLSXP. \cr + numeric \tab DOUBLE \tab x \tab I.e. REALSXP. \cr + " \tab INT96 \tab \tab \cr + " \tab FLOAT \tab \tab \cr + " \tab DECIMAL (INT32) \tab \tab \cr + " \tab DECIMAL (INT64) \tab \tab \cr + " \tab INT(*, *) \tab \tab \cr + " \tab FLOAT16 \tab \tab \cr + POSIXct \tab TIMESTAMP(true, MICROS) \tab x \tab Sub-microsecond precision is lost. \cr +} + + +The non-default mappings can be selected via the \code{schema} argument. E.g. +to write out a factor column called 'name' as \code{ENUM}, use + +\if{html}{\out{
}}\preformatted{write_parquet(..., schema = parquet_schema(name = "ENUM")) +}\if{html}{\out{
}} +The detailed mapping rules are listed below, in order of preference. These rules will likely change until nanoparquet reaches version 1.0.0. \enumerate{ \item Factors (i.e. vectors that inherit the \emph{factor} class) are converted @@ -70,9 +110,44 @@ non-default mappings are: \section{Parquet's data types}{ When reading a Parquet file nanoparquet also relies on logical types and the Arrow metadata (if present, see below) in addition to the low level -data types. The exact rules are below. +data types. The following table summarizes the mappings. See more details +below.\tabular{lll}{ + Parquet type \tab R type \tab Notes \cr + \emph{Logical types} \tab \tab \cr + BSON \tab character \tab \cr + DATE \tab Date \tab \cr + DECIMAL \tab numeric \tab REALSXP, potentially losing precision. \cr + ENUM \tab character \tab \cr + FLOAT16 \tab numeric \tab REALSXP \cr + INT(8, *) \tab integer \tab \cr + INT(16, *) \tab integer \tab \cr + INT(32, *) \tab integer \tab Large unsigned values may overflow! \cr + INT(64, *) \tab numeric \tab REALSXP \cr + INTERVAL \tab list(raw) \tab Missing values are \code{NULL}. \cr + JSON \tab character \tab \cr + LIST \tab \tab Not supported. \cr + MAP \tab \tab Not supported. \cr + STRING \tab factor \tab If Arrow metadata says it is a factor. Also UTF8. \cr + " \tab character \tab Otherwise. Also UTF8. \cr + TIME \tab hms::hms \tab Also TIME_MILLIS and TIME_MICROS. \cr + TIMESTAMP \tab POSIXct \tab Also TIMESTAMP_MILLIS and TIMESTAMP_MICROS. \cr + UUID \tab character \tab In \code{00112233-4455-6677-8899-aabbccddeeff} form. \cr + UNKNOWN \tab \tab Not supported. \cr + \emph{Primitive types} \tab \tab \cr + BOOLEAN \tab logical \tab \cr + BYTE_ARRAY \tab factor \tab If Arrow metadata says it is a factor. \cr + " \tab list(raw) \tab Otherwise. Missing values are \code{NULL}. \cr + DOUBLE \tab numeric \tab REALSXP \cr + FIXED_LEN_BYTE_ARRAY \tab list(raw) \tab Missing values are \code{NULL}. \cr + FLOAT \tab numeric \tab REALSXP \cr + INT32 \tab integer \tab \cr + INT64 \tab numeric \tab REALSXP \cr + INT96 \tab POSIXct \tab \cr +} -These rules will likely change until nanoparquet reaches version 1.0.0. + +The exact rules are below. These rules will likely change until nanoparquet +reaches version 1.0.0. \enumerate{ \item The \code{BOOLEAN} type is read as a logical vector (\code{LGLSXP}). \item The \code{STRING} logical type and the \code{UTF8} converted type is read as @@ -96,7 +171,7 @@ precision. \item The \code{ENUM} logical type is read as a character vector. \item The \code{UUID} logical type is read as a character vector that uses the \code{00112233-4455-6677-8899-aabbccddeeff} form. -\item The \code{FLOAT16} logical type is read as a real vector (\code{READLSXP}). +\item The \code{FLOAT16} logical type is read as a real vector (\code{REALSXP}). \item \code{BYTE_ARRAY} is read as a \emph{factor} object if the file was written by Arrow and the original data type of the column was a factor. (See 'The Arrow metadata below.) diff --git a/tools/types.Rmd b/tools/types.Rmd index 8b4ac0a..5a22f0c 100644 --- a/tools/types.Rmd +++ b/tools/types.Rmd @@ -1,8 +1,48 @@ # R's data types When writing out a data frame, nanoparquet maps R's data types to Parquet -logical types. This is how the mapping is performed. - +logical types. The following table is a summary of the mapping. For the +details see below. + +R type | Parquet type | Default | Notes +:--------|:------------------------|:-------:|:---------------------------------------- +character| STRING (BYTE_ARRAY) | x | I.e. STRSXP. Converted to UTF-8. +" | BYTE_ARRAY | | +" | FIXED_LEN_BYTE_ARRAY | | +" | ENUM | | +" | UUID | | +Date | DATE | x | +difftime | INT64 | x | If not hms::hms. Arrow metadata marks it as Duration(NS). +factor | STRING | x | Arrow metadata marks it as a factor. +" | ENUM | | +hms::hms | TIME(true, MILLIS) | x | Sub-milliseconds precision is lost. +integer | INT(32, true) | x | I.e. INTSXP. +" | INT64 | | +" | INT96 | | +" | DECIMAL (INT32) | | +" | DECIMAL (INT64) | | +" | INT(8, *) | | +" | INT(16, *) | | +" | INT(32, signed) | | +list | BYTE_ARRAY | | Must be a list of raw vectors. Messing values are `NULL`. +" | FIXED_LEN_BYTE_ARRAY | | Must be a list of raw vectors of the same length. Missing values are `NULL`. +logical | BOOLEAN | x | I.e. LGLSXP. +numeric | DOUBLE | x | I.e. REALSXP. +" | INT96 | | +" | FLOAT | | +" | DECIMAL (INT32) | | +" | DECIMAL (INT64) | | +" | INT(*, *) | | +" | FLOAT16 | | +POSIXct | TIMESTAMP(true, MICROS) | x | Sub-microsecond precision is lost. + +The non-default mappings can be selected via the `schema` argument. E.g. +to write out a factor column called 'name' as `ENUM`, use +```r +write_parquet(..., schema = parquet_schema(name = "ENUM")) +``` + +The detailed mapping rules are listed below, in order of preference. These rules will likely change until nanoparquet reaches version 1.0.0. 1. Factors (i.e. vectors that inherit the *factor* class) are converted @@ -61,9 +101,44 @@ non-default mappings are: When reading a Parquet file nanoparquet also relies on logical types and the Arrow metadata (if present, see below) in addition to the low level -data types. The exact rules are below. - -These rules will likely change until nanoparquet reaches version 1.0.0. +data types. The following table summarizes the mappings. See more details +below. + +Parquet type | R type | Notes +:--------------------|:----------|:--------------------------------------------- +*Logical types* | | +BSON | character | +DATE | Date | +DECIMAL | numeric | REALSXP, potentially losing precision. +ENUM | character | +FLOAT16 | numeric | REALSXP +INT(8, *) | integer | +INT(16, *) | integer | +INT(32, *) | integer | Large unsigned values may overflow! +INT(64, *) | numeric | REALSXP +INTERVAL | list(raw) | Missing values are `NULL`. +JSON | character | +LIST | | Not supported. +MAP | | Not supported. +STRING | factor | If Arrow metadata says it is a factor. Also UTF8. +" | character | Otherwise. Also UTF8. +TIME | hms::hms | Also TIME_MILLIS and TIME_MICROS. +TIMESTAMP | POSIXct | Also TIMESTAMP_MILLIS and TIMESTAMP_MICROS. +UUID | character | In `00112233-4455-6677-8899-aabbccddeeff` form. +UNKNOWN | | Not supported. +*Primitive types* | | +BOOLEAN | logical | +BYTE_ARRAY | factor | If Arrow metadata says it is a factor. +" | list(raw) | Otherwise. Missing values are `NULL`. +DOUBLE | numeric | REALSXP +FIXED_LEN_BYTE_ARRAY | list(raw) | Missing values are `NULL`. +FLOAT | numeric | REALSXP +INT32 | integer | +INT64 | numeric | REALSXP +INT96 | POSIXct | + +The exact rules are below. These rules will likely change until nanoparquet +reaches version 1.0.0. 1. The `BOOLEAN` type is read as a logical vector (`LGLSXP`). 1. The `STRING` logical type and the `UTF8` converted type is read as @@ -87,7 +162,7 @@ These rules will likely change until nanoparquet reaches version 1.0.0. 1. The `ENUM` logical type is read as a character vector. 1. The `UUID` logical type is read as a character vector that uses the `00112233-4455-6677-8899-aabbccddeeff` form. -1. The `FLOAT16` logical type is read as a real vector (`READLSXP`). +1. The `FLOAT16` logical type is read as a real vector (`REALSXP`). 1. `BYTE_ARRAY` is read as a *factor* object if the file was written by Arrow and the original data type of the column was a factor. (See 'The Arrow metadata below.)