diff --git a/inst/WORDLIST b/inst/WORDLIST
index 288f396..f9dab00 100644
--- a/inst/WORDLIST
+++ b/inst/WORDLIST
@@ -1,15 +1,25 @@
BROTLI
+BSON
CMD
DuckDB
+ENUM
GZIP
Gzip
+INTSXP
+JSON
+LGLSXP
LLC
LZ
LZO
+MILLIS
MacBook
ORCID
PBC
+POSIXct
+REALSXP
RLE
+STRSXP
+UUID
ZSTD
Zstd
codec
diff --git a/man/nanoparquet-types.Rd b/man/nanoparquet-types.Rd
index 210d2de..8e01500 100644
--- a/man/nanoparquet-types.Rd
+++ b/man/nanoparquet-types.Rd
@@ -8,8 +8,48 @@ How nanoparquet maps R types to Parquet types.
}
\section{R's data types}{
When writing out a data frame, nanoparquet maps R's data types to Parquet
-logical types. This is how the mapping is performed.
+logical types. The following table is a summary of the mapping. For the
+details see below.\tabular{llcl}{
+ R type \tab Parquet type \tab Default \tab Notes \cr
+ character \tab STRING (BYTE_ARRAY) \tab x \tab I.e. STRSXP. Converted to UTF-8. \cr
+ " \tab BYTE_ARRAY \tab \tab \cr
+ " \tab FIXED_LEN_BYTE_ARRAY \tab \tab \cr
+ " \tab ENUM \tab \tab \cr
+ " \tab UUID \tab \tab \cr
+ Date \tab DATE \tab x \tab \cr
+ difftime \tab INT64 \tab x \tab If not hms::hms. Arrow metadata marks it as Duration(NS). \cr
+ factor \tab STRING \tab x \tab Arrow metadata marks it as a factor. \cr
+ " \tab ENUM \tab \tab \cr
+ hms::hms \tab TIME(true, MILLIS) \tab x \tab Sub-milliseconds precision is lost. \cr
+ integer \tab INT(32, true) \tab x \tab I.e. INTSXP. \cr
+ " \tab INT64 \tab \tab \cr
+ " \tab INT96 \tab \tab \cr
+ " \tab DECIMAL (INT32) \tab \tab \cr
+ " \tab DECIMAL (INT64) \tab \tab \cr
+ " \tab INT(8, *) \tab \tab \cr
+ " \tab INT(16, *) \tab \tab \cr
+ " \tab INT(32, signed) \tab \tab \cr
+ list \tab BYTE_ARRAY \tab \tab Must be a list of raw vectors. Messing values are \code{NULL}. \cr
+ " \tab FIXED_LEN_BYTE_ARRAY \tab \tab Must be a list of raw vectors of the same length. Missing values are \code{NULL}. \cr
+ logical \tab BOOLEAN \tab x \tab I.e. LGLSXP. \cr
+ numeric \tab DOUBLE \tab x \tab I.e. REALSXP. \cr
+ " \tab INT96 \tab \tab \cr
+ " \tab FLOAT \tab \tab \cr
+ " \tab DECIMAL (INT32) \tab \tab \cr
+ " \tab DECIMAL (INT64) \tab \tab \cr
+ " \tab INT(*, *) \tab \tab \cr
+ " \tab FLOAT16 \tab \tab \cr
+ POSIXct \tab TIMESTAMP(true, MICROS) \tab x \tab Sub-microsecond precision is lost. \cr
+}
+
+
+The non-default mappings can be selected via the \code{schema} argument. E.g.
+to write out a factor column called 'name' as \code{ENUM}, use
+
+\if{html}{\out{
}}\preformatted{write_parquet(..., schema = parquet_schema(name = "ENUM"))
+}\if{html}{\out{
}}
+The detailed mapping rules are listed below, in order of preference.
These rules will likely change until nanoparquet reaches version 1.0.0.
\enumerate{
\item Factors (i.e. vectors that inherit the \emph{factor} class) are converted
@@ -70,9 +110,44 @@ non-default mappings are:
\section{Parquet's data types}{
When reading a Parquet file nanoparquet also relies on logical types and
the Arrow metadata (if present, see below) in addition to the low level
-data types. The exact rules are below.
+data types. The following table summarizes the mappings. See more details
+below.\tabular{lll}{
+ Parquet type \tab R type \tab Notes \cr
+ \emph{Logical types} \tab \tab \cr
+ BSON \tab character \tab \cr
+ DATE \tab Date \tab \cr
+ DECIMAL \tab numeric \tab REALSXP, potentially losing precision. \cr
+ ENUM \tab character \tab \cr
+ FLOAT16 \tab numeric \tab REALSXP \cr
+ INT(8, *) \tab integer \tab \cr
+ INT(16, *) \tab integer \tab \cr
+ INT(32, *) \tab integer \tab Large unsigned values may overflow! \cr
+ INT(64, *) \tab numeric \tab REALSXP \cr
+ INTERVAL \tab list(raw) \tab Missing values are \code{NULL}. \cr
+ JSON \tab character \tab \cr
+ LIST \tab \tab Not supported. \cr
+ MAP \tab \tab Not supported. \cr
+ STRING \tab factor \tab If Arrow metadata says it is a factor. Also UTF8. \cr
+ " \tab character \tab Otherwise. Also UTF8. \cr
+ TIME \tab hms::hms \tab Also TIME_MILLIS and TIME_MICROS. \cr
+ TIMESTAMP \tab POSIXct \tab Also TIMESTAMP_MILLIS and TIMESTAMP_MICROS. \cr
+ UUID \tab character \tab In \code{00112233-4455-6677-8899-aabbccddeeff} form. \cr
+ UNKNOWN \tab \tab Not supported. \cr
+ \emph{Primitive types} \tab \tab \cr
+ BOOLEAN \tab logical \tab \cr
+ BYTE_ARRAY \tab factor \tab If Arrow metadata says it is a factor. \cr
+ " \tab list(raw) \tab Otherwise. Missing values are \code{NULL}. \cr
+ DOUBLE \tab numeric \tab REALSXP \cr
+ FIXED_LEN_BYTE_ARRAY \tab list(raw) \tab Missing values are \code{NULL}. \cr
+ FLOAT \tab numeric \tab REALSXP \cr
+ INT32 \tab integer \tab \cr
+ INT64 \tab numeric \tab REALSXP \cr
+ INT96 \tab POSIXct \tab \cr
+}
-These rules will likely change until nanoparquet reaches version 1.0.0.
+
+The exact rules are below. These rules will likely change until nanoparquet
+reaches version 1.0.0.
\enumerate{
\item The \code{BOOLEAN} type is read as a logical vector (\code{LGLSXP}).
\item The \code{STRING} logical type and the \code{UTF8} converted type is read as
@@ -96,7 +171,7 @@ precision.
\item The \code{ENUM} logical type is read as a character vector.
\item The \code{UUID} logical type is read as a character vector that uses the
\code{00112233-4455-6677-8899-aabbccddeeff} form.
-\item The \code{FLOAT16} logical type is read as a real vector (\code{READLSXP}).
+\item The \code{FLOAT16} logical type is read as a real vector (\code{REALSXP}).
\item \code{BYTE_ARRAY} is read as a \emph{factor} object if the file was written
by Arrow and the original data type of the column was a factor.
(See 'The Arrow metadata below.)
diff --git a/tools/types.Rmd b/tools/types.Rmd
index 8b4ac0a..5a22f0c 100644
--- a/tools/types.Rmd
+++ b/tools/types.Rmd
@@ -1,8 +1,48 @@
# R's data types
When writing out a data frame, nanoparquet maps R's data types to Parquet
-logical types. This is how the mapping is performed.
-
+logical types. The following table is a summary of the mapping. For the
+details see below.
+
+R type | Parquet type | Default | Notes
+:--------|:------------------------|:-------:|:----------------------------------------
+character| STRING (BYTE_ARRAY) | x | I.e. STRSXP. Converted to UTF-8.
+" | BYTE_ARRAY | |
+" | FIXED_LEN_BYTE_ARRAY | |
+" | ENUM | |
+" | UUID | |
+Date | DATE | x |
+difftime | INT64 | x | If not hms::hms. Arrow metadata marks it as Duration(NS).
+factor | STRING | x | Arrow metadata marks it as a factor.
+" | ENUM | |
+hms::hms | TIME(true, MILLIS) | x | Sub-milliseconds precision is lost.
+integer | INT(32, true) | x | I.e. INTSXP.
+" | INT64 | |
+" | INT96 | |
+" | DECIMAL (INT32) | |
+" | DECIMAL (INT64) | |
+" | INT(8, *) | |
+" | INT(16, *) | |
+" | INT(32, signed) | |
+list | BYTE_ARRAY | | Must be a list of raw vectors. Messing values are `NULL`.
+" | FIXED_LEN_BYTE_ARRAY | | Must be a list of raw vectors of the same length. Missing values are `NULL`.
+logical | BOOLEAN | x | I.e. LGLSXP.
+numeric | DOUBLE | x | I.e. REALSXP.
+" | INT96 | |
+" | FLOAT | |
+" | DECIMAL (INT32) | |
+" | DECIMAL (INT64) | |
+" | INT(*, *) | |
+" | FLOAT16 | |
+POSIXct | TIMESTAMP(true, MICROS) | x | Sub-microsecond precision is lost.
+
+The non-default mappings can be selected via the `schema` argument. E.g.
+to write out a factor column called 'name' as `ENUM`, use
+```r
+write_parquet(..., schema = parquet_schema(name = "ENUM"))
+```
+
+The detailed mapping rules are listed below, in order of preference.
These rules will likely change until nanoparquet reaches version 1.0.0.
1. Factors (i.e. vectors that inherit the *factor* class) are converted
@@ -61,9 +101,44 @@ non-default mappings are:
When reading a Parquet file nanoparquet also relies on logical types and
the Arrow metadata (if present, see below) in addition to the low level
-data types. The exact rules are below.
-
-These rules will likely change until nanoparquet reaches version 1.0.0.
+data types. The following table summarizes the mappings. See more details
+below.
+
+Parquet type | R type | Notes
+:--------------------|:----------|:---------------------------------------------
+*Logical types* | |
+BSON | character |
+DATE | Date |
+DECIMAL | numeric | REALSXP, potentially losing precision.
+ENUM | character |
+FLOAT16 | numeric | REALSXP
+INT(8, *) | integer |
+INT(16, *) | integer |
+INT(32, *) | integer | Large unsigned values may overflow!
+INT(64, *) | numeric | REALSXP
+INTERVAL | list(raw) | Missing values are `NULL`.
+JSON | character |
+LIST | | Not supported.
+MAP | | Not supported.
+STRING | factor | If Arrow metadata says it is a factor. Also UTF8.
+" | character | Otherwise. Also UTF8.
+TIME | hms::hms | Also TIME_MILLIS and TIME_MICROS.
+TIMESTAMP | POSIXct | Also TIMESTAMP_MILLIS and TIMESTAMP_MICROS.
+UUID | character | In `00112233-4455-6677-8899-aabbccddeeff` form.
+UNKNOWN | | Not supported.
+*Primitive types* | |
+BOOLEAN | logical |
+BYTE_ARRAY | factor | If Arrow metadata says it is a factor.
+" | list(raw) | Otherwise. Missing values are `NULL`.
+DOUBLE | numeric | REALSXP
+FIXED_LEN_BYTE_ARRAY | list(raw) | Missing values are `NULL`.
+FLOAT | numeric | REALSXP
+INT32 | integer |
+INT64 | numeric | REALSXP
+INT96 | POSIXct |
+
+The exact rules are below. These rules will likely change until nanoparquet
+reaches version 1.0.0.
1. The `BOOLEAN` type is read as a logical vector (`LGLSXP`).
1. The `STRING` logical type and the `UTF8` converted type is read as
@@ -87,7 +162,7 @@ These rules will likely change until nanoparquet reaches version 1.0.0.
1. The `ENUM` logical type is read as a character vector.
1. The `UUID` logical type is read as a character vector that uses the
`00112233-4455-6677-8899-aabbccddeeff` form.
-1. The `FLOAT16` logical type is read as a real vector (`READLSXP`).
+1. The `FLOAT16` logical type is read as a real vector (`REALSXP`).
1. `BYTE_ARRAY` is read as a *factor* object if the file was written
by Arrow and the original data type of the column was a factor.
(See 'The Arrow metadata below.)