Bumped version.

jorgecarleitao · Aug 28, 2021 · 1d310ee · 1d310ee
1 parent 881816b
commit 1d310ee
Show file tree

Hide file tree

Showing 5 changed files with 71 additions and 22 deletions.
diff --git a/.github_changelog_generator b/.github_changelog_generator
@@ -1,7 +1,7 @@
-since-tag=v0.2.0
-future-release=v0.3.0
+since-tag=v0.3.0
+future-release=v0.4.0
 pr-wo-labels=false
-add-sections={"features":{"prefix":"**Enhancements:**","labels":["enhancement"]}, "documentation":{"prefix":"**Documentation updates:**","labels":["documentation"]}}
+add-sections={"features":{"prefix":"**Enhancements:**","labels":["enhancement"]}, "documentation":{"prefix":"**Documentation updates:**","labels":["documentation"]}, "testing":{"prefix":"**Testing updates:**","labels":["testing"]}}
 enhancement-label=**New features:**
 enhancement-labels=feature
 base=CHANGELOG.md
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,37 @@
 # Changelog
 
+## [v0.4.0](https://github.com/jorgecarleitao/parquet2/tree/v0.4.0) (2021-08-28)
+
+[Full Changelog](https://github.com/jorgecarleitao/parquet2/compare/v0.3.0...v0.4.0)
+
+**Breaking changes:**
+
+- Make `write_*` return the number of written bytes. [\#45](https://github.com/jorgecarleitao/parquet2/issues/45)
+- move `HybridRleDecoder` from `read::levels` to `encoding::hybrid_rle` [\#41](https://github.com/jorgecarleitao/parquet2/issues/41)
+- Simplified split of page buffer [\#37](https://github.com/jorgecarleitao/parquet2/pull/37) ([jorgecarleitao](https://github.com/jorgecarleitao))
+- Simplified API to get page iterator [\#36](https://github.com/jorgecarleitao/parquet2/pull/36) ([jorgecarleitao](https://github.com/jorgecarleitao))
+
+**New features:**
+
+- Added support to write to async writers. [\#35](https://github.com/jorgecarleitao/parquet2/pull/35) ([jorgecarleitao](https://github.com/jorgecarleitao))
+
+**Fixed bugs:**
+
+- Fixed edge case of a small bitpacked. [\#43](https://github.com/jorgecarleitao/parquet2/pull/43) ([jorgecarleitao](https://github.com/jorgecarleitao))
+- Fixed error in decoding RLE-hybrid. [\#40](https://github.com/jorgecarleitao/parquet2/pull/40) ([jorgecarleitao](https://github.com/jorgecarleitao))
+
+**Enhancements:**
+
+- Removed requirement of "Seek" on write. [\#44](https://github.com/jorgecarleitao/parquet2/pull/44) ([jorgecarleitao](https://github.com/jorgecarleitao))
+
+**Documentation updates:**
+
+- Added guide to read [\#38](https://github.com/jorgecarleitao/parquet2/pull/38) ([jorgecarleitao](https://github.com/jorgecarleitao))
+
+**Testing updates:**
+
+- Made tests deserializer use the correct decoder. [\#46](https://github.com/jorgecarleitao/parquet2/pull/46) ([jorgecarleitao](https://github.com/jorgecarleitao))
+
 ## [v0.3.0](https://github.com/jorgecarleitao/parquet2/tree/v0.3.0) (2021-08-09)
 
 [Full Changelog](https://github.com/jorgecarleitao/parquet2/compare/v0.2.0...v0.3.0)

diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "parquet2"
-version = "0.3.0"
+version = "0.4.0"
 license = "Apache-2.0"
 description = "Safe implementation of parquet IO."
 authors = ["Jorge C. Leitao <jorgecarleitao@gmail.com", "Apache Arrow <dev@arrow.apache.org>"]

diff --git a/README.md b/README.md
@@ -2,26 +2,28 @@
 
 This is a re-write of the official [`parquet` crate](https://crates.io/crates/parquet) with performance, parallelism and safety in mind.
 
-Checkout [the guide](https://jorgecarleitao.github.io/parquet2/) for details on how to use
-this crate to read parquet.
+Checkout [the guide](https://jorgecarleitao.github.io/parquet2/) for details
+on how to use this crate to read parquet.
 
 The five main differentiators in comparison with `parquet` are:
 * it uses `#![forbid(unsafe_code)]`
 * delegates parallelism downstream
 * decouples reading (IO intensive) from computing (CPU intensive)
 * it is faster (10-20x when reading to arrow format)
-* Is integration-tested against pyarrow 3 and (py)spark 3
+* supports `async` read and write.
+* It is integration-tested against pyarrow and (py)spark 3
 
 The overall idea is to offer the ability to read compressed parquet pages
 and a toolkit to decompress them to their favourite in-memory format.
 
-This allows this crate's iterators to perform _minimal_ CPU work, thereby maximizing throughput.
+This allows this crate's iterators to perform _minimal_ CPU work,
+thereby maximizing throughput.
 It is up to the consumers to decide whether they want to take advantage of this
 through parallelism at the expense of memory usage (e.g. decompress and deserialize
 pages in threads) or not.
 
-This crate cannot be used directly to read parquet (except metadata). To read data from parquet,
-checkout [arrow2](https://github.com/jorgecarleitao/arrow2).
+This crate cannot be used directly to read parquet (except metadata).
+To read data from parquet, checkout [arrow2](https://github.com/jorgecarleitao/arrow2).
 
 ## Functionality implemented
 
@@ -49,8 +51,9 @@ of them. They are:
 * [Delta length byte array](https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
 * [Delta strings](https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-strings-delta_byte_array--7)
 
-Delta-encodings are still experimental, as I have been unable to generate large pages encoded
-with them from spark, thereby hindering robust integration tests.
+Delta-encodings are still experimental, as I have been unable to
+generate large pages encoded with them from spark, thereby hindering
+robust integration tests.
 
 #### Encoding
 
@@ -90,12 +93,17 @@ before. This is only needed once (per change in the `integration-tests/integrati
 
 ## How to implement page readers
 
-The in-memory format used to consume parquet pages strongly influences how the pages should be deserialized. As such, this crate does not commit to a particular in-memory format. Consumers are responsible for converting pages to their target in-memory format.
+The in-memory format used to consume parquet pages strongly influences
+how the pages should be deserialized. As such, this crate does
+not commit to a particular in-memory format. Consumers are responsible
+for converting pages to their target in-memory format.
 
-This git repository contains a serialization to a simple in-memory format in `integration`, that is
+This git repository contains a serialization to a simple in-memory
+format in `integration`, that is
 used to validate integration with other implementations.
 
-There is also an implementation for the arrow format [here](https://github.com/jorgecarleitao/arrow2).
+There is also an implementation for the arrow format
+[here](https://github.com/jorgecarleitao/arrow2).
 
 ### Higher Parallelism
 
@@ -116,7 +124,9 @@ for column in columns {
 let columns_from_all_groups = handles.join_all();
 ```
 
-this will read the file as quickly as possible in the main thread and send CPU-intensive work to other threads, thereby maximizing IO reads (at the cost of storing multiple compressed pages in memory; buffering is also an option here).
+this will read the file as quickly as possible in the main thread and send CPU-intensive work to other threads, thereby maximizing IO reads
+(at the cost of storing multiple compressed pages in memory;
+buffering is also an option here).
 
 ## Decoding flow
 
@@ -128,8 +138,10 @@ Generally, a parquet file is read as follows:
 
 This is IO-intensive, requires parsing thrift, and seeking within a file.
 
-Once a compressed page is loaded into memory, it can be decompressed, decoded and deserialized into a specific in-memory format. All of these operations are CPU-intensive
-and are thus left to consumers to perform, as they may want to send this work to threads.
+Once a compressed page is loaded into memory, it can be decompressed, decoded
+and deserialized into a specific in-memory format. All of these
+operations are CPU-intensive and are thus left to consumers to perform,
+as they may want to send this work to threads.
 
 `read -> compressed page -> decompressed page -> decoded bytes -> deserialized`
 

diff --git a/guide/src/README.md b/guide/src/README.md
@@ -3,7 +3,8 @@
 Parquet2 is a rust library to interact with the
 [parquet format](https://en.wikipedia.org/wiki/Apache_Parquet), welcome to its guide!
 
-This guide describes on how to efficiently and safely read and write to and from parquet.
+This guide describes on how to efficiently and safely read and write
+to and from parquet.
 Before starting, there are two concepts to introduce in the context of this guide:
 
 * IO-bound operations: perform either disk reads or network calls (e.g. s3)
@@ -17,18 +18,22 @@ operations are not.
 
 ## Metadata
 
-The starting point of reading a parquet file is reading its metadata (at the end of the file).
-To do so, we offer two functions, `parquet2::read::read_metadata`, for sync reads:
+The starting point of reading a parquet file is reading its
+metadata (at the end of the file).
+To do so, we offer two functions for `sync` and `async`:
 
 #### Sync
 
+`parquet2::read::read_metadata` for `sync` reads:
+
 ```rust,no_run,noplayground
 {{#include ../../examples/read_metadata.rs:metadata}}
 ```
 
 #### Async
 
-and `parquet2::read::read_metadata_async`, for async reads (using `tokio::fs` as example):
+and `parquet2::read::read_metadata_async`, for async reads
+(using `tokio::fs` as example):
 
 ```rust
 {{#include ../../examples/read_metadata_async/src/main.rs}}