docs(python): Improved user guide for cloud functionality (#11646)

pola-rs · Oct 10, 2023 · 8ece165 · 8ece165
1 parent 35f3ff9
commit 8ece165
Show file tree

Hide file tree

Showing 18 changed files with 313 additions and 102 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -151,8 +151,65 @@ The most important components of Polars documentation are the [user guide](https
 
 ### User guide
 
-The user guide is maintained in the `docs` folder.
-Further contributing information will be added shortly.
+The user guide is maintained in the `docs/user-guide` folder. Before creating a PR first raise an issue to discuss what you feel is missing or could be improved.
+
+#### Building and serving the user guide
+
+The user guide is built using [MkDocs](https://www.mkdocs.org/). You install the dependencies for building the user guide by running `make requirements` in the root of the repo.
+
+Run `mkdocs serve` to build and serve the user guide so you can view it locally and see updates as you make changes.
+
+#### Creating a new user guide page
+
+Each user guide page is based on a `.md` markdown file. This file must be listed in `mkdocs.yml`.
+
+#### Adding a shell code block
+
+To add a code block with code to be run in a shell with tabs for Python and Rust, use the following format:
+
+````
+=== ":fontawesome-brands-python: Python"
+
+    ```shell
+    $ pip install fsspec
+    ```
+
+=== ":fontawesome-brands-rust: Rust"
+
+    ```shell
+    $ cargo add aws_sdk_s3
+    ```
+````
+
+#### Adding a code block
+
+The snippets for Python and Rust code blocks are in the `docs/src/python/` and `docs/src/rust/` directories, respectively. To add a code snippet with Python or Rust code to a `.md` page, use the following format:
+
+```
+{{code_block('user-guide/io/cloud-storage','read_parquet',[read_parquet,read_csv])}}
+```
+
+- The first argument is a path to either or both files called `docs/src/python/user-guide/io/cloud-storage.py` and `docs/src/rust/user-guide/io/cloud-storage.rs`.
+- The second argument is the name given at the start and end of each snippet in the `.py` or `.rs` file
+- The third argument is a list of links to functions in the API docs. For each element of the list there must be a corresponding entry in `docs/_build/API_REFERENCE_LINKS.yml`
+
+If the corresponding `.py` and `.rs` snippet files both exist then each snippet named in the second argument to `code_block` above must exist or the build will fail. An empty snippet should be added to the `.py` or `.rs` file if the snippet is not needed.
+
+Each snippet is formatted as follows:
+
+```python
+# --8<-- [start:read_parquet]
+import polars as pl
+
+df = pl.read_parquet("file.parquet")
+# --8<-- [end:read_parquet]
+```
+
+The snippet is delimited by `--8<-- [start:<snippet_name>]` and `--8<-- [end:<snippet_name>]`. The snippet name must match the name given in the second argument to `code_block` above.
+
+#### Linting
+
+Before committing, install `dprint` (see above) and run `dprint fmt` from the `docs` directory to lint the markdown files.
 
 ### API reference
 

diff --git a/docs/_build/API_REFERENCE_LINKS.yml b/docs/_build/API_REFERENCE_LINKS.yml
@@ -12,8 +12,7 @@ python:
   write_csv: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_csv.html
   read_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_json.html
   write_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_json.html
-  read_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html
-  write_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html
+  read_ipc: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_ipc.html
   min: https://pola-rs.github.io/polars/py-polars/html/reference/series/api/polars.Series.min.html
   max: https://pola-rs.github.io/polars/py-polars/html/reference/series/api/polars.Series.max.html
   value_counts: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.value_counts.html
@@ -65,6 +64,7 @@ python:
   write_database:
     name: write_database
     link: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_database.html
+  read_database_uri: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_database_uri.html
   read_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html
   write_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html
   scan_parquet: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_parquet.html
@@ -73,6 +73,7 @@ python:
   write_ndjson: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_ndjson.html
   write_json: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_json.html
   scan_ndjson: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_ndjson.html
+  scan_pyarrow_dataset: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_pyarrow_dataset.html
   from_arrow:
     name: from_arrow
     link: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.from_arrow.html
@@ -197,7 +198,7 @@ rust:
     feature_flags: ['json']
   read_ndjson:
     name: JsonLineReader
-    link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/ndjson_core/ndjson/struct.JsonLineReader.html
+    link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/ndjson/core/struct.JsonLineReader.html
     feature_flags: ['json']
   write_json:
     name: JsonWriter
@@ -223,6 +224,12 @@ rust:
     name: scan_parquet
     link: https://pola-rs.github.io/polars/docs/rust/dev/polars/prelude/struct.LazyFrame.html#method.scan_parquet
     feature_flags: ['parquet']
+  read_ipc:
+    name: IpcReader
+    link: https://pola-rs.github.io/polars/docs/rust/dev/polars_io/prelude/struct.IpcReader.html
+    feature_flags: ['ipc']
+  scan_pyarrow_dataset: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_pyarrow_dataset.html
+
   min: https://pola-rs.github.io/polars/docs/rust/dev/polars/series/struct.Series.html#method.min
   max: https://pola-rs.github.io/polars/docs/rust/dev/polars/series/struct.Series.html#method.max
   struct:

diff --git a/docs/src/python/user-guide/io/aws.py b/docs/src/python/user-guide/io/aws.py
diff --git a/docs/src/python/user-guide/io/cloud-storage.py b/docs/src/python/user-guide/io/cloud-storage.py
@@ -0,0 +1,63 @@
+"""
+# --8<-- [start:read_parquet]
+import polars as pl
+
+source = "s3://bucket/*.parquet"
+
+df = pl.read_parquet(source)
+# --8<-- [end:read_parquet]
+
+# --8<-- [start:scan_parquet]
+import polars as pl
+
+source = "s3://bucket/*.parquet"
+
+storage_options = {
+    "aws_access_key_id": "<secret>",
+    "aws_secret_access_key": "<secret>",
+    "aws_region": "us-east-1",
+}
+df = pl.scan_parquet(source, storage_options=storage_options)  
+# --8<-- [end:scan_parquet]
+
+# --8<-- [start:scan_parquet_query]
+import polars as pl
+
+source = "s3://bucket/*.parquet"
+
+
+df = pl.scan_parquet(source).filter(pl.col("id") < 100).select("id","value").collect()
+# --8<-- [end:scan_parquet_query]
+
+# --8<-- [start:scan_pyarrow_dataset]
+import polars as pl
+import pyarrow.dataset as ds
+
+dset = ds.dataset("s3://my-partitioned-folder/", format="parquet")  
+(
+    pl.scan_pyarrow_dataset(dset)
+    .filter("foo" == "a")
+    .select(["foo", "bar"])
+    .collect()
+)  
+# --8<-- [end:scan_pyarrow_dataset]
+
+# --8<-- [start:write_parquet]
+
+import polars as pl
+import s3fs
+
+df = pl.DataFrame({
+    "foo": ["a", "b", "c", "d", "d"],
+    "bar": [1, 2, 3, 4, 5],
+})
+
+fs = s3fs.S3FileSystem()
+destination = "s3://bucket/my_file.parquet"
+
+# write parquet
+with fs.open(destination, mode='wb') as f:
+    df.write_parquet(f)
+# --8<-- [end:write_parquet]
+
+"""
diff --git a/docs/src/python/user-guide/io/database.py b/docs/src/python/user-guide/io/database.py
@@ -1,32 +1,44 @@
 """
-# --8<-- [start:read]
+# --8<-- [start:read_uri]
 import polars as pl
 
-connection_uri = "postgres://username:password@server:port/database"
+uri = "postgres://username:password@server:port/database"
 query = "SELECT * FROM foo"
 
-pl.read_database(query=query, connection_uri=connection_uri)
-# --8<-- [end:read]
+pl.read_database_uri(query=query, uri=uri)
+# --8<-- [end:read_uri]
+
+# --8<-- [start:read_cursor]
+import polars as pl
+from sqlalchemy import create_engine
+
+conn = create_engine(f"sqlite:///test.db")
+
+query = "SELECT * FROM foo"
+
+pl.read_database(query=query, connection=conn.connect())
+# --8<-- [end:read_cursor]
+
 
 # --8<-- [start:adbc]
-connection_uri = "postgres://username:password@server:port/database"
+uri = "postgres://username:password@server:port/database"
 query = "SELECT * FROM foo"
 
-pl.read_database(query=query, connection_uri=connection_uri, engine="adbc")
+pl.read_database_uri(query=query, uri=uri, engine="adbc")
 # --8<-- [end:adbc]
 
 # --8<-- [start:write]
-connection_uri = "postgres://username:password@server:port/database"
+uri = "postgres://username:password@server:port/database"
 df = pl.DataFrame({"foo": [1, 2, 3]})
 
-df.write_database(table_name="records",  connection_uri=connection_uri)
+df.write_database(table_name="records",  uri=uri)
 # --8<-- [end:write]
 
 # --8<-- [start:write_adbc]
-connection_uri = "postgres://username:password@server:port/database"
+uri = "postgres://username:password@server:port/database"
 df = pl.DataFrame({"foo": [1, 2, 3]})
 
-df.write_database(table_name="records", connection_uri=connection_uri, engine="adbc")
+df.write_database(table_name="records", uri=uri, engine="adbc")
 # --8<-- [end:write_adbc]
 
 """
diff --git a/docs/src/python/user-guide/io/json.py b/docs/src/python/user-guide/io/json.py
@@ -0,0 +1,24 @@
+# --8<-- [start:setup]
+import polars as pl
+
+# --8<-- [end:setup]
+
+"""
+# --8<-- [start:read]
+df = pl.read_json("docs/data/path.json")
+# --8<-- [end:read]
+
+# --8<-- [start:readnd]
+df = pl.read_ndjson("docs/data/path.json")
+# --8<-- [end:readnd]
+
+"""
+
+# --8<-- [start:write]
+df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
+df.write_json("docs/data/path.json")
+# --8<-- [end:write]
+
+# --8<-- [start:scan]
+df = pl.scan_ndjson("docs/data/path.json")
+# --8<-- [end:scan]
diff --git a/.../src/python/user-guide/lazy/query_plan.py → .../src/python/user-guide/lazy/query-plan.py b/.../src/python/user-guide/lazy/query_plan.py → .../src/python/user-guide/lazy/query-plan.py
diff --git a/docs/src/python/user-guide/sql/sql_select.py → docs/src/python/user-guide/sql/select.py b/docs/src/python/user-guide/sql/sql_select.py → docs/src/python/user-guide/sql/select.py
diff --git a/docs/src/rust/user-guide/io/aws.rs → docs/src/rust/user-guide/io/cloud-storage.rs b/docs/src/rust/user-guide/io/aws.rs → docs/src/rust/user-guide/io/cloud-storage.rs
@@ -1,5 +1,5 @@
 """
-# --8<-- [start:bucket]
+# --8<-- [start:read_parquet]
 use aws_sdk_s3::Region;
 
 use aws_config::meta::region::RegionProviderChain;
@@ -28,5 +28,18 @@ async fn main() {
 
     println!("{:?}", df);
 }
-# --8<-- [end:bucket]
+# --8<-- [end:read_parquet]
+
+# --8<-- [start:scan_parquet]
+# --8<-- [end:scan_parquet]
+
+# --8<-- [start:scan_parquet_query]
+# --8<-- [end:scan_parquet_query]
+
+# --8<-- [start:scan_pyarrow_dataset]
+# --8<-- [end:scan_pyarrow_dataset]
+
+# --8<-- [start:write_parquet]
+# --8<-- [end:write_parquet]
+
 """
diff --git a/docs/src/rust/user-guide/io/json-file.rs → docs/src/rust/user-guide/io/json.rs b/docs/src/rust/user-guide/io/json-file.rs → docs/src/rust/user-guide/io/json.rs
diff --git a/docs/user-guide/io/aws.md b/docs/user-guide/io/aws.md
diff --git a/docs/user-guide/io/cloud-storage.md b/docs/user-guide/io/cloud-storage.md
@@ -0,0 +1,51 @@
+# Cloud storage
+
+Polars can read and write to AWS S3, Azure Blob Storage and Google Cloud Storage. The API is the same for all three storage providers.
+
+To read from cloud storage, additional dependencies may be needed depending on the use case and cloud storage provider:
+
+=== ":fontawesome-brands-python: Python"
+
+    ```shell
+    $ pip install fsspec s3fs adlfs gcsfs
+    ```
+
+=== ":fontawesome-brands-rust: Rust"
+
+    ```shell
+    $ cargo add aws_sdk_s3 aws_config tokio --features tokio/full
+    ```
+
+## Reading from cloud storage
+
+Polars can read a CSV, IPC or Parquet file in eager mode from cloud storage.
+
+{{code_block('user-guide/io/cloud-storage','read_parquet',['read_parquet','read_csv','read_ipc'])}}
+
+This eager query downloads the file to a buffer in memory and creates a `DataFrame` from there. Polars uses `fsspec` to manage this download internally for all cloud storage providers.
+
+## Scanning from cloud storage with query optimisation
+
+Polars can scan a Parquet file in lazy mode from cloud storage. We may need to provide further details beyond the source url such as authentication details or storage region. Polars looks for these as environment variables but we can also do this manually by passing a `dict` as the `storage_options` argument.
+
+{{code_block('user-guide/io/cloud-storage','scan_parquet',['scan_parquet'])}}
+
+This query creates a `LazyFrame` without downloading the file. In the `LazyFrame` we have access to file metadata such as the schema. Polars uses the `object_store.rs` library internally to manage the interface with the cloud storage providers and so no extra dependencies are required in Python to scan a cloud Parquet file.
+
+If we create a lazy query with [predicate and projection pushdowns](../lazy/optimizations.md), the query optimiszr will apply them before the file is downloaded. This can significantly reduce the amount of data that needs to be downloaded. The query evaluation is triggered by calling `collect`.
+
+{{code_block('user-guide/io/cloud-storage','scan_parquet_query',[])}}
+
+## Scanning with PyArrow
+
+We can also scan from cloud storage using PyArrow. This is particularly useful for partitioned datasets such as Hive partitioning.
+
+We first create a PyArrow dataset and then create a `LazyFrame` from the dataset.
+
+{{code_block('user-guide/io/cloud-storage','scan_pyarrow_dataset',['scan_pyarrow_dataset'])}}
+
+## Writing to cloud storage
+
+We can write a `DataFrame` to cloud storage in Python using s3fs for S3, adlfs for Azure Blob Storage and gcsfs for Google Cloud Storage. In this example, we write a Parquet file to S3.
+
+{{code_block('user-guide/io/cloud-storage','write_parquet',['write_parquet'])}}